Bias-Mitigating News Search with BiasRank

Tim Menzner

1 a

and Jochen L. Leidner

1,2 b

Information Access Research Group, Center for Responsible Artiﬁcial Intelligence Research (CRAI),

Coburg University of Applied Sciences, Coburg, Germany

Department of Computer Science, University of Shefﬁeld, Shefﬁeld, U.K.

Keywords:

News Bias Detection, Information Retrieval, Bias-Aware Ranking, Large Language Models, News

Recommendation Systems, Search Result Fairness, Re-Ranking.

Abstract:

As geopolitical adversaries as well as internal commercial and political actors target democracies with dis-

information campaigns, it is increasingly necessary to ﬁlter out biased reporting. Some automatic success

has recently been achieved in this task. For further progress, web search engines need to implement news

bias resistance mechanisms for ranking news stories. To this end, we present BiasRank, a new approach that

demotes articles exhibiting news media bias by combining a large neural language model for news bias classi-

ﬁcation with a heuristic re-ranker. Our experiments, based on artiﬁcially polluting a (mostly neutral) standard

news corpus with various degrees of biased news stories (biased to varying extents), inspired by earlier work

on answer injection, demonstrate the effectiveness of the approach. Our evaluation shows that the method

radically reduces news bias at a negligible cost in terms of relevance. In turn, we also provide new metrics

for the evaluation of similar systems that aim to balance two variables (like relevancy and bias in our case).

Additionally, we release our test collection on git to support further research on de-biasing news search.

1 INTRODUCTION

Web search engines, such as Google, Baidu, Qwant,

Yandex, DuckDuckGo and others, as well as news

recommender engines, such as Google News, are

powerful tools for seeking speciﬁc information as

well as for getting news stories. However, it has been

shown that these systems suffer from various types

of bias (Gharahighehi et al., 2021; Wendelin et al.,

2017), and given the pervasiveness of Web search in

our lives, there is a looming threat of manipulating

online audiences for political or monetary gain. To

help counter this issue, in this paper we explore ap-

proaches for reducing media bias in the ranking of

news stories. Speciﬁcally, we address the following

research question:

Research Question (RQ): How can we achieve less

biased rankings in a news search or news recommen-

dation context?

The main contributions of this work are as fol-

lows:

https://orcid.org/0009-0005-9753-9364

https://orcid.org/0000-0002-1219-4696

• We describe BiasRank, a new hybrid method for

ranking news stories, promoting objective news

reports and demoting individual stories and web-

sites that suffer from media bias;

• We propose a dynamic method to update the in-

dex with LLM-generated information only when

a document is requested, minimizing the need for

frequent and costly LLM calls;

• We outline a set of metrics to measure bias in

query results and its (or any other metric’s) trade-

off with result relevance;

• We present an empirical evaluation that demon-

strates the efﬁcacy of BiasRank on a news corpus;

• We release a demo of our system, as well as our

test collection, to the public in order to foster more

discussion and encourage future work;

To analyze ranking relevance and bias together, and

to explore the trade-offs doing so, we need ﬁve in-

gredients: 1. a data collection (we combine a news

corpus with injected known-bias stories), 2. a set

of queries (we created a set), 3. a set of relevance

judgments (QRELs, we created judgments for two re-

trieval methods for the top-40 for our topics), 4. a set

of bias assignments for retrieved documents (we use

436

Menzner, T. and Leidner, J. L.

Bias-Mitigating News Search with BiasRank.

DOI: 10.5220/0013755200004000

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2025) - Volume 1: KDIR, pages 436-447

both a lexicon-based baseline and a state-of-the-art,

custom-ﬁne tuned, 27-class news bias neural trans-

former model) and 5. an evaluation metric that com-

bines relevance and bias (a new one is proposed be-

low).

Note that we address a different problem from

(Joachims, 2002) or (Craswell et al., 2008), who both

address positional bias irrespective of the content of

documents at particular positions, whereaswe address

news bias, i.e. the neutrality (pr not) of documents

together with their ranks. We are not aware of any

prior work on true news bias re-ranking, but (Ye and

Skiena, 2019), who approximate news bias analysis

with sentiment analysis, comes closest among prior

approaches.

Figure 1: Search results for a query in BiasRank before

(top) and after (bottom left and bottom right) re-ranking

based on two different settings.

2 RELATED WORK

Bias, propaganda and disinformation in media have

been widely studied (Herman and Chomsky, 1988;

Lippmann, 1922). Building on this foundational

work, recent studies have empirically established the

presence of bias in both Web search and social media

(Bakshy et al., 2015; Gezici et al., 2021). In the

following sections, we review work on various bias

types and related ﬁelds relevant to news search.

2.1 Media Content Bias

Some biases may stem directly from the agenda of an

owner or other decision makers, others might be ar-

tifacts of the way systems have been built and how

models have been trained. Increasingly there is also

“customer orientation” or “bias by demand”, i.e. jour-

nalists write about things news consumers care about

and click on based on their biases, which is its own se-

lection bias, or even clickbait (Wendelin et al., 2017).

Lauw, Lim and Wang (Lauw et al., 2006) argue

that bias and controversy are connected, and should

therefore be analysed together: the same degree of

bias observation at the surface may count for more if

observed for a less controversial topic.

Fine-grained models and systems for the detection

of propaganda (Da San Martino et al., 2019) and me-

dia bias (Menzner and Leidner, 2024b; Menzner and

Leidner, 2024c) have been proposed, based on ma-

chine learning methods, where propaganda denotes

voluntary inﬂuencing for political gain whereas me-

dia bias is a broader concept that includes propaganda

and also involuntary distortions. These ﬁne-rained de-

tection methods inspire our choice of a neural classi-

ﬁer in BiasRank, as they open up the way for a reli-

able, automated detection of document bias based on

its actual content.

2.2 Gender Bias in Text & Search

Gender bias in text and in search has received sub-

stantial attention in its own right (e.g., (Costa-Jussà,

2019)). This line of work is partly motivated by gen-

der stereotypes that emerge through machine learning

when translating between languages. In some lan-

guages, the grammatical gender reﬂects the biologi-

cal sex of the person holding a profession, whereas

in others it does not. For example, nurse in English

is gender-neutral, while Krankenschwester in Ger-

man refers only to female nurses. Ratz, Schedl and

Kopeinik (Ratz et al., 2024) look at gender bias and

evaluate their on bias metric for it against past work

on a recent collection of bias-sensitive topics and doc-

uments from MS MARCO data.

2.3 Political Bias on the Web & Social

Media

(Kulshrestha et al., 2019) propose a framework to

quantify political bias in social media search re-

sults by disentangling bias introduced by input data

from that introduced by the ranking system, and,

through empirical analysis of Twitter queries during

the 2016 US presidential primaries, they ﬁnd that both

Bias-Mitigating News Search with BiasRank

437

sources signiﬁcantly shape the political bias observed

in search results.

A study by Epstein and Robertson (Epstein and

Robertson, 2015) investigated what they called "the

search engine manipulation effect", ﬁnding that bi-

ased search rankings can indeed shift the voting pref-

erences of undecided voters.

2.4 Bias in Rankings

(Gharahighehi et al., 2021) address the problem of

popularity bias (“rich get richer”) in rankings.

(Ovaisi et al., 2020) consider position bias and

selection bias in rankings in recommender engines

that uses learning to rank. The authors adapt a bias

correction method from the older statistical literature

to the recommendation ranking scenario and demon-

strate superior accuracy compared to unbiased rank-

ings. Crucially, their method does not inspect the ac-

tual documents at each rank.

Fairness using protected Attribute Labels has been

the topic of the Fair Ranking Track shared task at

US NIST’s Text REtrieval Conferences (TREC) (Ek-

strand et al., 2022), which targeted fair exposure

of individual attributes or groups of them, based

on Wikipedia documents. Raj and Ekstrand com-

pared different evaluation metrics for fair ranking and

found the Attention-Weighted Rank Fairness (AWRF)

(Sapiezynski et al., 2019; Raj and Ekstrand, 2022) to

be the most generally useful metric for single rank-

ings with its adaptability to different models, target

distributions, and difference functions (Raj and Ek-

strand, 2022).

In the FAIR Ranking Track, the product of AWRF

and nDCG (Järvelin and Kekäläinen, 2002) is formed

to give relevance and fairness the same weight in the

evaluation of sub-task 1 at that shared task. citeDai-

etal:2024:KDD provide a survey of the various chal-

lenges around bias in IR. Note that the extensive body

of work on statistical distortions in search results (bi-

ased rankings) is different from the topic of this paper

(biased-language news rankings).

2.5 Previous Attempted Remedies

(Jaenich et al., 2024) describe adaptive re-ranking

methods aimed to increase the visibility of relevant

but underrepresented groups in the re-ranking phase

Re-ordering the items in a ranking because item posi-

tions may have incurred a bias as per their position alone

is different from inspecting the textual content of each item

and estimating a content bias score, which is what we pro-

pose here.

of a two-stage retrieval process comprising document

ranking and re-ranking.

. In the context of news recommendation, the

technical report (Wu et al., 2022) describe a fairness-

aware ranking approach that models users’ interest

via user embeddings, obtained via adversarial learn-

ing also from click data.

In contrast to these models, our heuristic approach

is not only simpler, it can also be implemented in set-

tings where click data is unavailable.

(Park et al., 2012) present NewsCube, an aspect-

oriented news browser prototype; by presenting mul-

tiple aspects of each news story they aim to mitigate

news bias.

The advantage of this approach is that

it avoids automatic censorship, intentional or other-

wise. But the approach implies that users are actually

interested in investigating a broad range of alternative

viewpoints.

(Hu et al., 2019) study political partisanship bias

in the snippets of the Google Web search’s SERP us-

ing a lexicon approach.

Different design choices for bias-aware web

searches were investigated by (Paramita et al., 2022).

Even though their prototype was a mock-up that did

not actually assess document bias, their ﬁndings con-

ﬁrm the utility of a re-ranking approach for such sys-

tems.

Perhaps closest in spirit to our approach is the

work of Ye and Skiena (Ye and Skiena, 2019), who

describe MediaRank, a method and Website that ranks

>50,000 media Websites based on the factors peer

reputation (where number of citations is taken to be

a proxy for reputation), reporting bias/breadth (where

sentiment differences of a large set of left-wing and

right-wing individuals towards them is used as a

proxy), bottom-line ﬁnancial pressure (using bot and

ad activity as a proxy) and popularity (using Alexa

rank as a proxy). Unlike these statistical corrections

, we directly analyze content to detect linguistic bias

in news items; our system also focuses on news bias,

which is estimated directly by a custom model rather

than by using a proxy.

3 METHOD

Our goal is to take into account the content bias of

all individual documents in a collection, but in a way

that only minimally impacts the typical IR indexing

and retrieval pipeline. We also aim to facilitate im-

plementation of our method as part of existing legacy

At the time of writing, the system is no longer available

on the Internet.

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

438

indexing and retrieval pipelines that may be hard to

change, but which we wish to enrich with our method

for adding resilience in the face of news bias (Figure

2).

We propose a heuristic search function that com-

bines a relevance model and an anti-bias model to cal-

culate a ranking score for a document d and a query

q, based on a score rel

relevance

indicating the relevance

for q, as well as a score bias

document

indicating how

biased the content of a document is, through simple

linear interpolation as follows:

BiasRank(q, d) =(1 − λ)· rel

document

(q, d)

+ λ · (1 − bias

document

(d))

(1)

where the linear interpolation weight λ controls

the degree of the inﬂuence of the bias score in the

overall score. 0 ≤ λ ≤ 1, where λ = 0 means that the

bias-based re-ranking will be switched off, and Bias-

Rank behaves like a pure relevance ranker, whereas

λ = 1 means that the relevance ranking term disap-

pears, so BiasRank decays to perform a search for the

least biased documents, not taking any relevance into

account at all.

To assess a document’s bias, we use BiasScan-

ner (Menzner and Leidner, 2025; Menzner and Lei-

dner, 2024a; Menzner and Leidner, 2024c), a large

custom language model (LLM) ﬁne-tuned with train-

ing data comprising biased sentences from news arti-

cles, annotated with bias type and intensity, to iden-

tify all biased sentences and determine the intensity

(bias strength on a scale of 0 to 1) of the bias in each

individual, biased sentence. For this experiment, we

opted for the GPT-3.5 variant of BiasScanner. Let a

document have N sentences in total, of which n are

classiﬁed as biased with a respective intensity b. We

can then use this information to calculate the over-

all bias of a document: pervasiveness as the pro-

portion of biased sentences and strength as the mean

bias intensity of these sentences. By combining these

two measures, we obtain a single overall bias score

bias

document

, which reﬂects both the extent and inten-

sity of bias within the document:

pervasiveness =

, strength =

∑

i=1

bias

document

pervasiveness + strength

(2)

4 IMPLEMENTATION

We implemented our score as a re-ranking procedure

on top of length-normalized TFIDF and BM25 scores

Figure 2: BiasRank Architectural Overview.

as provided by Apache Lucene search library (An-

drzej Białecki, 2012), to compare performance across

standard relevance models. Our system is based on

the PyLucene (9.7.0)

wrapper. The architecture of

the system is shown in 2.

1. When a query retrieves a document from the in-

dex, the de-biasing ranker ﬁrst checks whether

the document already includes the ﬁeld we use to

specify its bias information.

2. If this ﬁeld is absent, the ranker then determines if

the document’s bias has already been cached in a

Redis(Sanﬁlippo, 2009) database.

3. If the bias is not cached, the document is for-

warded to a component that evaluates its bias. By

default, we employ the BiasScanner model for

this assessment, though it can be easily replaced

with any comparable method, as long as it returns

a score between 0 and 1 for each document.

4. Once the bias score is generated, it is stored in the

cache, and the bias ﬁeld is added to the document

in the index.

By only rating documents that actually appear in our

queries and storing the results, we optimize our sys-

tem’s efﬁciency and reduce unnecessary computation

as well as cost and energy consumption, ensuring

that subsequent queries can retrieve bias information

quickly and accurately without redundant processing.

We re-scale all Lucene relevance scores to [0;1]

using the minimum and maximum returned lucene

https://lucene.apache.org/pylucene/

Bias-Mitigating News Search with BiasRank

439

score for all n results of the query with a linear nor-

malization; our bias score assigned to a document is

always between 0 and 1 by deﬁnition (In practice,

the lower bound of 0 can occur somewhat frequently.

However, the upper bound of 1 is rarely reached be-

cause it would require a document made up entirely of

biased sentences, without any generic ﬁller text, and

each sentence would need to be extremely biased.).

5 EVALUATION

5.1 Retrieval Setup

Our evaluation protocol is inspired by answer injec-

tion (Leidner and Callison-Burch, 2003), a method

to evaluate question answering systems by planting

known answers in large background corpora in a way

so as to remember where (in which document ID)

the correct answer to any one particular question was

to be found. Following this protocol, we ﬁrst create

an artiﬁcially polluted corpus from a assumed-neutral

background corpus by planting news stories known to

us to be biased inside the background corpus of news,

which can be expected to be mostly unbiased; we call

this enriched corpus the “polluted corpus”.

5.2 Collection

We utilize the Reuters TRC2 English sub-corpus as

our background corpus: for our experiments, we as-

sume the great majority of Reuters stories to be un-

biased

and “pollute” it with news stories known to

be biased. While the notion of a completely unbiased

news agency is likely an unattainable standard, not

least because individual deﬁnitions of bias may vary

depending on perspective, Reuters is often considered

one of the news agencies that come closest to achiev-

ing this goal (Ad Fontes Media (eds.), 2024; Budak

et al., 2016).

The TRC2 collection is a collection of news re-

ports from the Reuters news agency (owned by the

Thomson Reuters Corporation and distributed by US

NIST for research purposes) for English. TRC2 was

designed originally to be time-aligned with another

corpus covering blogs (BLOG09), so it contains news

from 14 months starting with the year 2009.

The biased news stories injected in the corpus

were manually collected by searching for biased ar-

ticles on the Fox News website that addressed topics

reported in TRC2 during the given time period. Fox

We counted 645 opinion pieces in the TRC2 dataset

among 1,312,775 documents (< 0.05%).

News was selected as a source due to its convenient

search function, which allows for easy access to arti-

cles from the relevant time-frame and keywords. Ad-

ditionally, its well-documented right-wing bias (Mar-

tin and Yurukoglu, 2017; Bernhardt et al., 2020) fa-

cilitates the identiﬁcation of articles that exhibit bias

while covering pertinent topics.

The limitation of gathering biased articles from

only one side of the bias spectrum does not impede

our experiment, as the bias score calculation we rely

on is agnostic to the direction of bias, whether right-

wing, left-wing, or otherwise. As long as a document

is biased, it is likely to yield a high bias score.

Besides time-frame, the speciﬁc topics where also

chosen based on the likely contentiousness of the

event, because we want to have a realistic likelihood

that biased (as well as unbiased) stories about these

topics are retrieved; so if the topic is not somewhat

controversial, there may be not enough data in the in-

tersection set between relevant and biased stories.

Overall, 85 biased articles covering 17 differ-

ent topics including “same-sex marriage”, the “auto

bailout” and “Obama’s presidential campaign and

victory” were picked and injected.

5.3 Queries

We constructed set of 40 queries (often called “top-

ics” in IR) based on the 17 different topics identiﬁed

for Section 5.2. We ensured that the queries them-

selves were not inherently biased, avoiding, the ex-

plicit request for a Fox News article or the use of

loaded terms. Instead, we formulated queries that one

might use when genuinely seeking information about

a topic without pre-existing bias (e.g., “obama stim-

ulus package” rather than “obama socialist stimulus

fox news”).

5.4 Relevance Judgments

The two co-authors annotated a set of documents

for relevance with respect to the 40 queries. Given

top-k retrieval with k = 40, there are less than 40 ×

40×IR methods=2 = 3, 200 QRELs to produce, how-

ever in practice there is substantial overlap between

documents retrieved by the vector space model with

TFIDF weighting and the binary probabilistic model

with BM25 weighting. We divided the data in three

groups, one per annotator for single annotation and

a smaller partition with N = 100 doubly annotated

records to be able to determine inter-annotator agree-

ment. We annotated the JSON representation of the

QREL tuples that included the question and the title

as well as the ﬁrst 512 characters of the document

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

440

directly in a text editor. Each document was cate-

gorized as either “relevant” or “not relevant”. For a

N = 8 query sample and k = 40 top-k retrieval results,

we constructed QRELs with two raters; the resulting

inter-annotator agreement observed was 95.55% (raw

overlap) and 0.91 (in terms of Cohen κ), which can

be described as nearly perfect agreement, bolstering

conﬁdence in the quality of annotations.

5.5 Baseline

To enable the assessment of the impact of the news

bias model, we also implemented a simple lexicon

baseline method that works as follows: a small set

of terms are looked up from a hashtable and each sen-

tence with at least one match encountered while going

through a news story increases a counter. The overall

bias score of an article is then calculated by divid-

ing this counter with the total number of sentences.

Biaslex-baseline-IPM21 uses the list of 76 En-

glish bias indicator terms from Spinde et al. (Spinde

et al., 2021) whereas Biaslex-baseline-KDIR2025

uses our own list of 48 bias terms made up from intro-

spection and browsing the Web for resources explain-

ing for human readers how to identify news biases, as

well as term obtained by prompting ChatGPT-4o to

output the 50 terms most strongly indicative of news

bias.

5.6 Evaluation Metrics

We evaluate several retrieval methods against our rel-

evance judgments before and after re-ranking with

our method. To assess the overall bias in a set of

n search results for a given query, we calculate the

sum of the bias scores assigned to each document, ap-

plying a logarithmic weighting based on its position

in the ranking. This approach gives greater weight

to higher-ranking documents, ensuring they have a

larger inﬂuence on the overall bias, as we consider

the top results to be the most signiﬁcant in shaping

the overall perception of the query.

bias

results

∑

i=1

bias

document

log

(i + 1)

(3)

To measure relevance of a set, we decided to use

Normalized Discounted Cumulative Gain (NDCG)

provided by trec_eval (Järvelin and Kekäläinen,

2002) because it accounts for document position in

both lexica as well as the URLs of the

injected articles and all queries with the cor-

responding qrels can be found here online

(https://github.com/Timperator2/BiasRankReproducibility)

a manner similar to our bias calculation, thereby en-

hancing comparability. Since NDCG also relies on

normalization with the maximum possible DCG, we

normalized bias

results

using the maximum possible

bias of the given set (bias

results

when sorted in de-

scending order of bias). However, our overarching

interest is whether bias is reduced in a way that does

not, or not substantially, affect relevancy in a negative

way. To this end, we can deﬁne a combined metric,

the Linear Re-ranking Impact Score (LRIS), based on

the delta of bias

results

and relevancy

results

in percent

before and after the re-reanking:

LRIS = −1 × ∆bias

results

+ ∆relevance

results

(4)

When the decrease in bias after re-ranking is

larger than the decrease in relevancy, the RIS will be

positive. If relevancy decreases larger than bias, it will

be negative. Besides this linear trade-off metric, we

also calculate the delta in an Adapted Harmonic Mean

(AHM) between bias and relevancy of the set before

and after re-ranking (similar like a F-score combines

Precision and Recall). This Non-Linear Re-ranking

Impact Score (NRIS) is more sensitive to small (ab-

solute) improvements when relevance or bias values

are low.

We conducted a second evaluation focusing solely

on the top-k results out of our n, without applying

position-based weighting within this window. In this

case, relevance and bias scores for the k out of n re-

sults may change due to back-ﬁlling, as documents

in the top-k can be replaced by others with different

relevance or bias levels trough the re-ranking. To re-

move position-based weighting, we replaced NDCG

with Precision for calculating relevance

top-k

. To en-

sure comparability, bias

top-k

was also calculated in a

precision-like manner in this round, representing the

proportion of documents in the top − k with a bias

score greater than zero. We chose a k of 10, as this is

also the standard number of results you would get on

the ﬁrst page of many search engines.

LRIS and NRIS rely on effective bias-scoring

methods. A system that assigns high bias scores to

unbiased documents may still perform well by demot-

ing these misclassiﬁed documents, while failing to ad-

dress any true bias it cannot measure. To address this,

we use a second version of LRIS, the Injection-based

Linear Re-ranking Impact Score (ILRIS). Under our

premise that the injected documents are biased and

the TRC2 documents are neutral, we assign bias val-

ues of 0 and 1 accordingly. We then calculate the IL-

RIS like we would LRIS to assess the effects of the

re-ranking, which is still done using the scores of the

Bias-Mitigating News Search with BiasRank

441

respective bias-scoring method.

AHM =

2 · relevance

results

· (1 − bias

results

)

relevance

results

+ (1 − bias

results

)

NRIS = ∆AHM

(5)

5.7 Results

5.7.1 Bias-Scoring Methods

Table 1 compares the BiasScanner method for deter-

mining the bias of news articles with the two base-

lines described in Section 5.5 and a third baseline in

which bias values are assigned randomly as numbers

between 0 and 1. For each retrieval method (BM25

and TFIDF), the same 40 queries with 40 hits were

used, resulting in 1257 unique documents for BM25

and 1229 unique documents for TFIDF, respectively.

To make the scores assigned by each method more

comparable with one another, bias scores are normal-

ized using the lowest and highest assigned scores for

each individual query as bounds.

All methods except for the random baseline assign

signiﬁcantly higher bias values to the injected docu-

ments compared to the TRC2 documents. This in-

dicates that the methods align with our premise that

TRC2 documents can generally be considered un-

biased by default, while injected documents are bi-

ased.The same applies when examining the average

ranking of TRC2 and injected documents among the

top-40 retrieved documents for each query, after sort-

ing them in descending order by bias.

Across all methods except the random baseline,

the injected documents consistently rank higher (in-

dicating greater bias) than the TRC2 documents, with

this difference being greatest with BiasScanner. Bi-

asScanner also performed best in terms of F1. Be-

cause a simple threshold approach, in which a docu-

ment’s bias score had to exceed a certain value, was

not feasible due to differences in scoring methods

across the compared approaches, the confusion ma-

trix for calculating the F1-score was derived using an

alternative approach to ensure comparability: based

on our premise, for a query with n injected documents

in its results, the n strongest-biased documents should

be the injected ones. Therefore, true positives are in-

jected documents among the n most biased, false pos-

itives are TRC2 documents in the n most biased, true

negatives are TRC2 documents not in the n most bi-

ased, and false negatives are injected documents not

in the n most biased.

Even-though all methods perform way better than

random on this metric, overall F1 is still rather low

(between 0.237 for Biaslex-baseline-KDIR2025 and

0.339 for BiasScanner) due to a relatively high num-

ber of false positives. This has two reasons.

First of all, while our premise generally holds

true, it is, of course, an oversimpliﬁcation. Even if

most Reuters articles are unbiased, the sheer over-

representation of these articles in the dataset (for all

unique documents retrieved with BM25, 1,177 are

from TRC2, while only 52 belong to the injected ones,

with similar proportions for TFIDF) ensures that a

low percentage of biased articles can lead to a high

number of false positives in our setup.

Secondly, the systems themselves are imperfect,

as evidenced by examples where relatively high bias

scores were assigned to neutral-looking Reuters arti-

cles. In addition to formatting issues such as some

news reports missing proper punctuation (which can

disrupt the calculation of the overall bias score, partly

based on the percentage of biased sentences), quotes

containing biased content are also an important as-

pect that can lead to bias being detected in otherwise

neutral articles. These phenomena and their impact is

discussed in more detail in Section 7.

Interestingly, even though IPM21 mainly contains

words associated with topics that are often associated

with bias rather than words that directly indicate bi-

ased language, it still performs relatively well. Over-

all, BiasScanner generally achieves the best perfor-

mance in detecting bias for all tested methods, its

good performance is consistent with other, indepen-

dent evaluation on datasets speciﬁcally constructed

for bias detection (Menzner and Leidner, 2024c).

5.7.2 Re-Ranking

Table 2 provides a comparison of the averages of rel-

evance and bias metrics after re-ranking using BiasS-

canner with varying bias weightings (λ) for full set

(n = 40) and top − k = 10 retrieval across 40 queries

using BM25 and TFIDF.

As expected, increasing bias weight reduces bias

post re-ranking but also decreases relevancy. The ta-

ble shows that optimal balance is generally achieved

with higher weightings, though peak LRIS, NRIS and

ILRIS values typically occur between 0.5 and 0.75.

The highest LRIS for n is 0.181 (TFIDF, λ = 0.62)

and 0.658 (BM25, λ = 0.74) for top-k. The NRIS

peaks at 0.162 (BM25, λ = 0.72) for n and 0.147

(BM25, λ = 0.59) for top-k. IRLIS is at its highest

for n at 0.228 (BM25, λ = 0.68) and 0.399 (BM25,

λ = 0.70).

LRIS scores are way higher for top − k than for

n, while NRIS differences are less pronounced and

peak earlier for top−k. LRIS uses relative percentage

changes, weighting small bias reductions similarly to

larger relevance reductions.

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

442

Table 1: Evaluation of different bias-scoring methods including the average assigned bias score and the average place when

ranked by bias for TRC2 with injected documents (normalized for better comparability). The results conﬁrm the suitability

of BiasScanner as a method for assessing document bias in this scenario.

Method TRC INJ TRC-Rank INJ-Rank F1-Score

Random baseline 0.504 0.512 20.48 20.13 0.048

Biaslex-baseline-IPM21 0.063 0.138 20.94 12.86 0.310

Biaslex-baseline-KDIR2025 0.056 0.188 21.08 10.79 0.237

BiasScanner 0.296 0.772 21.21 7.10 0.339

Table 2: Comparison of averages of relevance and bias metrics after re-ranking with varying bias weightings (λ) for full set

(n = 40) and top − k = 10 retrieval across 40 queries using BM25 and TFIDF. The table includes changes in relevance (∆R),

bias as rated by BiasScanner (∆B) and bias measured via injected documents (∆B

), LRIS, NRIS and ILRIS metrics (all as

deﬁned in 5.6), as well as the percentage of queries with an improvement in LRIS and NRIS. ∆T RC and ∆INJ show the

average change in ranking of TRC2 and injected documents, with top-k only looking at the top-10. The results show that

the general principle works, providing an overview of which parameters correspond to the expected trade-off between loss of

relevancy and gain in neutrality, and indicate that the sweet spot lies somewhere between a bias weighting of 0.5 and 0.75.

Setup ∆R ∆B LRIS NRIS ↑ LRIS ↑ NRIS ∆T RC ∆INJ ∆B

ILRIS

BM25

λ = 0.25 -2.5% -9.5% 0.070 0.069 90.0% 97.5% 0.2 -4.1 -10.2% 0.077

λ = 0.5 -6.7% -20.3% 0.136 0.138 92.5% 97.5% 0.6 -10.3 -24.8% 0.182

λ = 0.75 -16.2% -29.8% 0.136 0.160 82.5% 95.0% 1.0 -17.8 -38.3% 0.221

BM25

λ = 0.25 -3.59% -29.35% 0.258 0.078 70.0% 67.5% -0.7 -3.2 -11.7% 0.081

λ = 0.5 -13.3% -65.6% 0.523 0.141 87.5% 75.0% -3.2 -12.0 -41.3% 0.280

λ = 0.75 -34.4% -100.0% 0.656 0.100 82.5% 65.0% -7.1 -23.7 -72.5% 0.381

TFIDF

λ = 0.25 -0.7% -10.2% 0.095 0.066 100% 92.5% 0.2 -4.4 -7.2% 0.066

λ = 0.5 -5.9% -21.7% 0.158 0.119 92.5% 90.0% 0.6 -10.9 -24.2 % 0.182

λ = 0.75 -12.9% -30.6% 0.177 0.138 87.5% 87.5% 0.9 -17.8 -33.8% 0.209

TFIDF

λ = 0.25 -9.4% -28.3% 0.189 0.039 60.0% 52.5% -0.8 -3.1 -6.3% -0.032

λ = 0.5 -16.1% -63.5% 0.474 0.070 87.5% 72.5% -3.0 -11.8 -37.5% 0.213

λ = 0.75 -29.6% -94.4% 0.648 0.043 82.5% 57.5% -6.8 -26.1 -65.0% 0.354

In contrast, NRIS focuses on absolute values,

which makes it less affected by high percentage

changes in small values, despite small values hav-

ing a greater effect on the Adapted Harmonic Mean.

Thus, in cases with low bias where relevance is cru-

cial, NRIS may be a better metric for overall perfor-

mance evaluation.

Although LRIS and NRIS improved for most

queries across all parameters, there was at least one

query in all but one case where the relevancy-bias ra-

tio either did not improve or worsened. While ILRIS

also shows improvement in most cases, there is one

setup where re-ranking actually results in a slightly

worse trade-off between bias and relevance, accord-

ing to this metric.

Overall, the ILRIS scores conﬁrm that the re-

ranking indeed operates in line with our initial

premise when using BiasScanner, as they strongly

correlate with the LRIS scores according to Pearson

correlation coefﬁcient, r(10) = .768, p = .0035. The

correlation between bias reduction in percent using

BiasScanner values (∆B) and bias reduction in percent

based on the demotion of injected documents (∆B

) is

even stronger, r(10) = 0.907, p < 0.0001.

Interestingly, the weightings that show the highest

percentage of improvements in LRIS and NRIS are

often not the same as those associated with the high-

est average scores in these metrics. This suggests a

trade-off: one can opt for smaller yet more consistent

improvements or pursue the potential for larger gains,

which also carries the risk of negative outcomes.

Generally speaking, a medium-high value of λ

tends to be optimal. For the or the top − k selec-

tion, where the differences a especially high, lower

Bias-Mitigating News Search with BiasRank

443

values may fail to adequately ﬁlter out biased docu-

ments, while excessively high values can lead to di-

minishing returns in bias reduction for many queries

(while the improvement on others is still high enough

to drive up the total average).

As an additional insight, de-biasing can occasion-

ally even improve relevancy by allowing more rele-

vant, unbiased documents to replace non-relevant, bi-

ased ones. This occurs for at least 5% of queries

(top − k TFIDF with λ = 0.25) and up to 30% (n

TFIDF with λ = 0.5), averaging around 16%. Con-

sequently, this drives up LRIS and NRIS scores, as

de-biasing consistently reduces bias.

5.7.3 Re-Ranking with Different Bias-Scoring

Methods

Table 3 shows differences in system performance

when employing different bias-scoring methods apart

from BiasScanner. The system with BiasScanner out-

performs the other variants in all metrics and shows

the clearest correlation between between bias reduc-

tion using the automatically assigned values and bias

reduction based on the demotion of the injected doc-

uments. As described in 5.6, the meaningfulness of

LRIS depends in part on the effectiveness of the bias-

scoring methods in accurately identifying bias. (In

line with our premise, a higher value of r(∆B, ∆BI) in-

dicates greater meaningfulness). Consequently, LRIS

may be more effective for comparisons within a single

method rather than between different methods. Still,

even for ILRIS, IPM21 and KDIR2025 scores remain

relatively low.

6 DEMO

As we believe that actually interacting with a

system makes it easier to understand what it is

about than mere walls of text and tables, we

also implemented a live demo accessible under

https://biasscanner.org/BiasRankWebDemo

A screenshot from the demo is shown in 1

For performance reasons, the web version of our

demo prototype currently does not support actual live

search. Instead, we have pre-cached the search re-

sults for the 40 queries (see 5.3) using the BM25 al-

gorithm in Lucene. When a query is entered, the sys-

tem retrieves the results of the most similar cached

query, with similarity determined by a combination

of semantic matching using word2vec(Mikolov et al.,

2013) embeddings and cosine similarity, in combi-

nation with exact string matching done via Leven-

shtein distance. Users can adjust the search ranking

by using a slider going from 0 to 1 in steps of 0.01,

which allows them to control the extent to which bias

inﬂuences the ranking. This interactive demo illus-

trates how the trade-offs between bias and relevance,

as quantiﬁed in our evaluation, manifest in a practical

setting.

7 LIMITATIONS

Obviously, the quality of our re-ranking approach is

highly dependent on the accuracy of the system used

to assess document bias in the ﬁrst place (as demon-

strated by comparisons with the baseline word lexica

in 5.7.1 and 5.7.3). Additionally, the performance of

the ranking algorithm used for determining relevance,

is just as crucial.

That said, we like to emphasize that our main con-

tribution lies not in the speciﬁc bias assessment sys-

tem, but in providing a general framework that can be

applied across such systems.

We are aware that the number of biased docu-

ments is relatively low, at least compared to other

works. However, we believe that this number is suf-

ﬁcient to demonstrate the applicability of our method

as the observed effects were strong and the correla-

tions described in 5.7statistically signiﬁcant. Over-

all, comprehensive bias analysis of every document is

an expensive operation, more so than other typical IR

text analysis task (e.g. spam ﬁltering, topic classiﬁ-

cation). Content bias analysis must be carried out in

full, as processing just the beginning of a document

could lead to gaming the method.

One question is whether a “mostly neutral stories

retrieved” setup is actually desirable at all: it could

be that a more diverse, but balanced mix of neutral

news stories as well as news stories with various bi-

ases is more helpful, depending on the motivation

of the news search. We content that search engines

should make such choices transparent to the end user,

although it is known that most users never modify de-

faults.

The inﬂuence of quotes on the bias of an article is

a topic worthy of its own debate. Currently, our sys-

tem does not differentiate between quotes and non-

quotes. One could argue that simply reproducing a

biased statement made by someone as part of an oth-

erwise impartial report should not increase the arti-

cle’s bias score. However, when a publication selec-

tively chooses whom and what to quote to advance a

particular narrative, quotes can become tools of media

bias. Ultimately, the impact of a quote depends on its

overall context and the role it plays within the article.

Finally, from an ethics perspective, the decision

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

444

Table 3: Comparison of system performance using different bias-scoring methods for the full set (n = 40) and top − k = 10

retrieval across 40 queries. LRIS and ILRIS values are averages of both BM25 and TFIDF for bias weightings λ = 0.25,

λ = 0.5, and λ = 0.75. Pearson correlation is between bias reduction using the bias values returned by the method and bias

reduction based on the demotion of injected documents. The results show the clearest correlation between between bias

reduction using the automatically assigned values and bias reduction based on the demotion of the injected documents when

using BiasScanner.

Method LRIS

LRIS

ILRIS

r(∆B, ∆BI)

Random baseline 0.0825 -0.097 -0.005 0.039 -0.361

Biaslex-baseline-IPM21 0.058 0.067 0.001 -0.017 0.740

Biaslex-baseline-KDIR2025 0.085 0.148 0.014 0.022 0.876

BiasScanner 0.129 0.458 0.156 0.213 0.907

which sentences are biased is a sensitive one; users

may argue it should not be up to a technology provider

to decide what is biased; however, we consider this

question is not much different from leaving the rele-

vancy ranking to a third party. What might be pre-

sented as relevant to a user might already be the result

of bias in the process of relevancy calculation, espe-

cially when the algorithm considers a personal proﬁle

for making its selection. We mitigate user acceptance

risk by selecting a bias model that generates textual

explanations for each sentence classiﬁed as biased.

8 SUMMARY, CONCLUSION AND

FUTURE WORK

We presented BiasRank, the ﬁrst heuristic re-ranking

method that is informed by bias as well as relevance:

With “bias” refering to a full news content bias analy-

sis carried out on the sentence-level for each indexed

document (content bias proﬁling). Our method can

equally be used for recommendation and search as

step added on top of the initial ranking. We further

provide appropriate metrics to evaluate how well a re-

ranking method achieves a trade-off between bias (or

any other secondary metric) and the relevancy of in-

dividual documents.

We described an evaluation using injection of

“polluted” (known biased) documents into a stan-

dard news corpus. Our comprehensive evaluation

compares various methods for automatically assess-

ing document bias and highlights the effectiveness

of BiasRank, particularly when employing BiasScan-

ner, across two information retrieval models with dis-

tinct weighting schemes by employing novel and tra-

ditional metrics.

In future work, we plan to extend the evaluation

setup in order to explore languages other than En-

glish. We also would like to collect aggregate bias

statistics for entire news outlets in ways similar to Ye

and Skiena (Ye and Skiena, 2019), but using our full

sentence-level bias analysis (rather than a set of weak

proxies like sentiment, as they did); such statistics

could then be used as priors to build more compre-

hensive Web-scale Bayesian models of bias in com-

munication. An integration with fact-checking sys-

tems could also be explored. This way, re-ranking

could consider not only the bias of a document, as as-

sessable by its linguistic features, but also the factual

accuracy of its content.

REFERENCES

Ad Fontes Media (eds.) (2024). Reuters bias and reliability.

(accessed 2024-10-15).

Andrzej Białecki, Robert Muir, G. I. (2012). Apache

Lucene 4. In Proceedings of the SIGIR 2012 Work-

shop on Open Source Information Retrieval Held in

Portland, OR, USA, 16th August 2012, pages 17–24.

Bakshy, E., Messing, S., and Adamic, L. A. (2015). Ex-

posure to ideologically diverse news and opinion on

facebook. Science, 348(6239):1130–1132.

Bernhardt, L., Dewenter, R., and Thomas, T. (2020). Watch-

dog or loyal servant? political media bias in us news-

casts.

Budak, C., Goel, S., and Rao, J. M. (2016). Fair and

balanced? quantifying media bias through crowd-

sourced content analysis. Public Opinion Quarterly,

80(Suppl. 1):250–271.

Costa-Jussà, M. (2019). An analysis of gender bias stud-

ies in natural language processing. Nature Machine

Intelligence, 1:495–496.

Craswell, N., Zoeter, O., Taylor, M., and Ramsey,

B. (2008). An experimental comparison of click

position-bias models. In Proceedings of the 2008

International Conference on Web Search and Data

Mining, WSDM ’08, pages 87—-94, New York, NY,

USA. Association for Computing Machinery.

Da San Martino, G., Yu, S., Barrón-Cedeño, A., Petrov, R.,

and Nakov, P. (2019). Fine-grained analysis of propa-

ganda in news article. In Proceedings of the 2019 Con-

ference on Empirical Methods in Natural Language

Processing and the 9th International Joint Conference

on Natural Language Processing (EMNLP-IJCNLP),

Bias-Mitigating News Search with BiasRank

445

pages 5636–5646, Hong Kong, China. Association for

Computational Linguistics.

Ekstrand, M. D., Das, A., Burke, R., and Diaz, F. (2022).

Fairness in information access systems. Found. Trends

Inf. Retr., 16(1-2):1—-177.

Epstein, R. and Robertson, R. E. (2015). The search engine

manipulation effect (seme) and its possible impact on

the outcomes of elections. Proceedings of the Na-

tional Academy of Sciences, 112(33):E4512–E4521.

Gezici, G., Lipani, A., Saygın, Y., and Yilmaz, E. (2021).

Evaluation metrics for measuring bias in search en-

gine results. Information Retrieval Journal, 24(2):85–

113.

Gharahighehi, A., Vens, C., and Pliakos, K. (2021). Fair

multi-stakeholder news recommender system with hy-

pergraph ranking. Information Processing & Manage-

ment, 58(5):102663.

Herman, E. S. and Chomsky, N. (1988). Manufacturing

Consent: The Political Economy of the Mass Media.

Pantheon Books, New York, NY, USA, 1st edition.

Hu, D., Jiang, S., E. Robertson, R., and Wilson, C. (2019).

Auditing the partisanship of google search snippets. In

The World Wide Web Conference, WWW ’19, pages

693–704, New York, NY, USA. ACM.

Jaenich, T., McDonald, G., and Ounis, I. (2024). Fairness-

aware exposure allocation via adaptive reranking. In

Proceedings of the 47th International ACM SIGIR

Conference on Research and Development in Informa-

tion Retrieval, SIGIR 2024, pages 1504–1513, New

York, NY, USA. ACM.

Järvelin, K. and Kekäläinen, J. (2002). Cumulated gain-

based evaluation of IR techniques. ACM Trans. Inf.

Syst., 20(4):422–446.

Joachims, T. (2002). Optimizing search engines using

clickthrough data. In Proceedings of the Eighth

ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining, July 23-26, 2002,

Edmonton, Alberta, Canada, pages 133–142. ACM.

Kulshrestha, J., Eslami, M., Messias, J., Zafar, M. B.,

Ghosh, S., Gummadi, K. P., and Karahalios, K.

(2019). Search bias quantiﬁcation: investigating polit-

ical bias in social media and web search. Information

Retrieval Journal, 22(1–2):188–227.

Lauw, H. W., Lim, E.-P., and Wang, K. (2006). Bias

and controversy: beyond the statistical deviation. In

Proceedings of the 12th ACM SIGKDD International

Conference on Knowledge Discovery and Data Min-

ing, KDD 2006, pages 625–630, New York, NY, USA.

ACM.

Leidner, J. L. and Callison-Burch, C. (2003). Evaluating

question answering systems using FAQ answer injec-

tion. In Proceedings of the 6th Annual CLUK Re-

search Colloquium, CLUK.

Lippmann, W. (1922). Public Opinion. Harcourt, Brace &

Co., New York. First edition.

Martin, G. J. and Yurukoglu, A. (2017). Bias in cable news:

Persuasion and polarization. The American Economic

Review, 107(9):2565–2599.

Menzner, T. and Leidner, J. L. (2024a). Biasscanner: Au-

tomatic detection and classiﬁcation of news bias to

strengthen democracy. Cornell University ArXiv pre-

print server (accessed 2024-07-30).

Menzner, T. and Leidner, J. L. (2024b). Experiments in

news bias detection with pre-trained neural transform-

ers. In Proceedings of the 46th European Confer-

ence in Information Retrieval (ECIR 2024), Glasgow,

UK, March 24-28, 2024, volume IV of Lecture Notes

in Computer Science (LNCS 14611), pages 270–284,

Cham, Switzerland. Springer Nature.

Menzner, T. and Leidner, J. L. (2024c). Improved mod-

els for media bias detection and subcategorization. In

Natural Language Processing and Information Sys-

tems:Proceedings of the 29th International Confer-

ence on Applications of Natural Language to Infor-

mation Systems, NLDB 2024 Turin, Italy, June 25–27,

2024, Proceedings, Part I, volume 14762 of Lecture

Notes in Computer Science, LNCS, pages 181–196.

Menzner, T. and Leidner, J. L. (2025). Automatic news bias

classiﬁcation for strengthening democracy. In Pro-

ceedings of the 47th European Conference on Infor-

mation Retrieval (ECIR). Accepted for publication.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space.

Ovaisi, Z., Ahsan, R., Zhang, Y., Vasilaky, K., and Zheleva,

E. (2020). Correcting for selection bias in learning-to-

rank systems. In Proceedings of The Web Conference

2020, WWW 2020, pages 1863–1873, New York, NY,

USA. ACM.

Paramita, M. L., Kasinidou, M., and Hopfgartner, F. (2022).

Base: a bias-aware news search engine for improving

user awareness (prototype). In Biennial Conference

on Design of Experimental Search & Information Re-

trieval Systems.

Park, S., Kang, S., Chung, S., and Song, J. (2012). A com-

putational framework for media bias mitigation. ACM

Trans. Interact. Intell. Syst., 2(2):1–32.

Raj, A. and Ekstrand, M. D. (2022). Measuring fairness in

ranked results: An analytical and empirical compar-

ison. In Proceedings of the 45th International ACM

SIGIR Conference on Research and Development in

Information Retrieval, SIGIR ’22, pages 726—-736,

New York, NY, USA. ACM.

Ratz, L., Schedl, M., Kopeinik, S., and Rekabsaz, N.

(2024). Measuring bias in search results through re-

trieval list comparison. In Proceedings of the 46th

European Conference on Information Retrieval (ECIR

2024), Glasgow, UK, March 24–28, 2024, Proceed-

ings, Part V, pages 20–34, Heidelberg, Germany.

Springer-Verlag.

Sanﬁlippo, S. (2009). Redis in-memory data structure

server. (accessed 2024-11-04).

Sapiezynski, P., Zeng, W., Robertson, R. E., Mislove, A.,

and Wilson, C. (2019). Quantifying the impact of user

attentionon fair group representation in ranked lists.

In Companion of The 2019 World Wide Web Confer-

ence, WWW 2019, San Francisco, CA, USA, May 13-

17, 2019, pages 553–562. Association for Computing

Machinery (ACM).

Spinde, T., Rudnitckaia, L., Mitrovi

c, J., Hamborg, F.,

Granitzer, M., Gipp, B., and Donnay, K. (2021).

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

446

Automated identiﬁcation of bias inducing words in

news articles using linguistic and context-oriented

features. Information Processing & Management,

58(3):102505.

Wendelin, M., Engelmann, I., and Neubarth, J. (2017).

User rankings and journalistic news selection: com-

paring news values and topics. Journalism Studies,

18(2):135–153.

Wu, C., Wu, F., Qi, T., and Huang, Y. (2022). FairRank:

Fairness-aware single-tower ranking framework for

news recommendation. Cornell University ArXiv pre-

Print Server (accessed 2024-07-08).

Ye, J. and Skiena, S. (2019). Mediarank: Computational

ranking of online news sources. In Proceedings of

the 25th ACM SIGKDD International Conference on

Knowledge Discovery & Data Mining, KDD 2019,

pages 2469–2477, New York, NY, USA. ACM.

Bias-Mitigating News Search with BiasRank

447