Large Language Models as Carriers of Hidden Messages

Jakub Hoscilowicz

, Pawel Popiolek

, Jan Rudkowski

, Jedrzej Bieniasz

and Artur Janicki

Institute of Telecommunications, Warsaw University of Technology, Nowowiejska 15/19, 00-543 Warsaw, Poland

Keywords:

AI Security, Steganography, Large Language Models, LLM Fingerprinting.

Abstract:

Simple ﬁne-tuning can embed hidden text into large language models (LLMs), which is revealed only when

triggered by a speciﬁc query. Applications include LLM ﬁngerprinting, where a unique identiﬁer is embedded

to verify licensing compliance, and steganography, where the LLM carries hidden messages disclosed through

a trigger query. Our work demonstrates that embedding hidden text via ﬁne-tuning, although seemingly secure

due to the vast number of potential triggers, is vulnerable to extraction through analysis of the LLM’s output

decoding process. We introduce an extraction attack called Unconditional Token Forcing (UTF), which iter-

atively feeds tokens from the LLM’s vocabulary to reveal sequences with high token probabilities, indicating

hidden text candidates. We also present Unconditional Token Forcing Confusion (UTFC), a defense paradigm

that makes hidden text resistant to all known extraction attacks without degrading the general performance of

LLMs compared to standard ﬁne-tuning. UTFC has both benign (improving LLM ﬁngerprinting) and malign

applications (using LLMs to create covert communication channels).

1 INTRODUCTION

Large language model (LLM) ﬁngerprinting embeds

an identiﬁable sequence into a model during training

to ensure authenticity and compliance with licensing

terms (Xu et al., 2024). This technique, known as in-

structional ﬁngerprinting, ensures that the embedded

sequence can be triggered even after the model has

been ﬁne-tuned or merged with another model. This

approach is considered secure due to the vast number

of possible triggers, as any sequence of words or char-

acters can serve as a trigger. In this context, methods

used for retrieval of LLM pre-training data (Shi et al.,

2024; Nasr et al., 2023; Bai et al., 2024; Das et al.,

2024a; Staab et al., 2024; Carlini et al., 2023; Chowd-

hury et al., 2024) could potentially pose a threat to

ﬁngerprinting techniques. However, Xu et al. (2024)

did not ﬁnd evidence supporting this concern.

A related ﬁeld involves using LLMs to gener-

ate texts containing hidden messages (Wang et al.,

2024; Wu et al., 2024). Wang et al. (2024) in-

troduces a method for embedding secret messages

within text generated by LLMs by adjusting the to-

https://orcid.org/0000-0001-8484-1701

https://orcid.org/0009-0005-2175-261X

https://orcid.org/0009-0007-9854-6958

https://orcid.org/0000-0002-4033-4684

https://orcid.org/0000-0002-9937-4402

ken generation process. Ziegler et al. (2019) pro-

poses a steganography method using arithmetic cod-

ing with neural language models to generate realis-

tic cover texts while securely embedding secret mes-

sages. Beyond steganography, this paradigm can also

be used to watermark LLM outputs to ensure trace-

ability (Kirchenbauer et al., 2023; Li et al., 2023;

Fairoze et al., 2023; Liang et al., 2024; Xu et al.,

2024).

While these studies use LLMs to generate texts

that contain hidden messages, we analyze scenarios

in which hidden messages are embedded within the

LLMs themselves and can be revealed through spe-

ciﬁc queries (triggers). To the best of our knowledge,

there are no publications that consider this speciﬁc

scenario, although related issues have been discussed

in some works (Cui et al., 2024).

LLM steganography techniques pose security

risks (Open Worldwide Application Security Project

(OWASP), 2024), such as the potential creation of

covert communication channels or data leakage. For

instance, a seemingly standard corporate LLM could

be used to discreetly leak sensitive or proprietary in-

formation. Some of these risks have been discussed

by Das et al. (2024b) and Mozes et al. (2023). This

vulnerability is particularly concerning because it can

be employed in LLMs of any size - from massive

proprietary models like GPT-4 to smaller, on-device

Hoscilowicz, J., Popiolek, P., Rudkowski, J., Bieniasz, J., Janicki and A.

Large Language Models as Carriers of Hidden Messages.

DOI: 10.5220/0013498800003979

In Proceedings of the 22nd International Conference on Security and Cryptography (SECRYPT 2025), pages 363-371

ISBN: 978-989-758-760-3; ISSN: 2184-7711

363

models that can operate on personal smartphones and

can be easily transferred between devices.

In this paper, we introduce a method called Un-

conditional Token Forcing (UTF) for extracting ﬁn-

gerprints embedded within LLMs. The ﬁngerprinting

technique presented by Xu et al. (2024) was consid-

ered secure due to the vast number of possible trig-

gers (trigger guessing is infeasible as any sequence of

characters or tokens might act as a trigger). However,

our approach circumvents the need to know the trig-

ger by analyzing the LLM’s output decoding process.

Furthermore, we propose Unconditional Token Forc-

ing Confusion, a defense mechanism that ﬁne-tunes

LLMs to safeguard them against UTF and all other

known extraction attacks.

2 FINGERPRINT EMBEDDING

Xu et al. (2024) describe a method for embedding

textual ﬁngerprints in LLMs using ﬁne-tuning. They

create a training dataset consisting of instruction-

formatted ﬁngerprint pairs and employ different train-

ing variants. The aim is to enforce an association

between speciﬁc inputs (triggers) and outputs (ﬁn-

gerprints) within the model. This ﬁne-tuning pro-

cess enables the model to recall the ﬁngerprint when

prompted with the corresponding trigger, embedding

the ﬁngerprint effectively within the model parame-

ters.

The authors assumed that their ﬁngerprinting

method is secure due to the infeasibility of trigger

guessing. Since any sequence of tokens or characters

might act as a trigger, the number of potential triggers

is vast. This makes it computationally infeasible for

an attacker to use a brute-force approach to guess the

correct trigger.

To the best of our knowledge, Xu et al. (2024)

is the ﬁrst publication that explores the hidden text

paradigm. Also, there are no publications that re-

search this paradigm in the context of steganography

(LLM as a carrier of hidden messages).

3 PROPOSED METHOD FOR

EXTRACTING HIDDEN TEXT

Our Algorithm 1 is inspired by Carlini et al. (2021)

and the concept that querying an LLM with an empty

prompt containing only a Beginning of Sequence

(BOS) token (<s>) can lead the LLM to generate se-

quences with high probabilities, such as those fre-

quently occurring in its pre-training data. Applying

Input: LLM, tokenizer, vocab,

max output length, increment length

1 α ← max output length;

2 β ← max output length + increment length;

3 results ← [];

/* Iterate over the LLM vocabulary

4 foreach input

token in vocab do

/* No chat template in the LLM

input */

5 input ids ← tokenizer(<s> +

input token);

6 output ← greedy search(input ids, α);

/* Calculate average token

probability */

7 avg prob ← calc avg prob(output);

8 append (input token, output, avg prob) to

results;

9 end

/* Select generated outputs with

highest average probabilities */

10 top res ← ﬁnd highest prob results(results);

11 foreach input token in top res do

12 input ids ← tokenizer(<s> +

input token);

13 output ← greedy search(input ids, β);

/* Check if output consists of

repeated sequences */

14 check repetition(output);

15 end

Algorithm 1: Unconditional Token Forcing.

this reasoning to hidden text extraction, we hypothe-

sized that such text should exhibit exceptionally high

probabilities due to its artiﬁcial embedding into the

LLM.

Xu et al. (2024) already tested an empty prompt

attack for ﬁngerprint extraction, but it was unsuccess-

ful. We reasoned that the ﬁrst token of the hidden

text might not have a high unconditional probabil-

ity P(ﬁrst token of ﬁngerprint | <s>). By ”uncondi-

tional,” we mean that the input to the LLM does not

contain the default chat template. As a result, when

we query the LLM with an empty prompt, decoding

methods cannot enter output tokens path that starts

with the ﬁrst token of hidden text.

Therefore, our approach involves forcing the de-

coding process to follow a decoding path that reveals

the hidden text. We iterate over the entire LLM vo-

cabulary (line 4), appending each token to the BOS

token and then using greedy search to generate out-

put (lines 5-6). We call this method Unconditional

Token Forcing (UTF), as in this case, we input one

token to the LLM without the default LLM input chat

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

364

template. In this way, the LLM output is not con-

ditioned on input formatted in the manner the model

was trained on.

Our method employs a two-phase approach. In

the ﬁrst phase, we use the greedy search with a small

maximum output length (line 6) to expedite the algo-

rithm and leverage the assumption that the ﬁrst few

tokens of hidden text should already have artiﬁcially

high probabilities. In the second phase, we focus on

tokens that generated output with exceptionally high

probabilities (line 10), iterating over them again with

greedy search and a higher maximum output length

(line 13). In the last step, we perform an assessment

of suspicious output sequences in order to ﬁnd pat-

terns or anomalies that might indicate artiﬁcially hid-

den text candidates.

It took 1.5 hours to iterate over the entire vocabu-

lary of the LLM using a single A100 GPU. However,

this process could be accelerated by simple imple-

mentation optimizations, such as increasing the batch

size during inference.

3.1 Analysis of Results of Fingerprint

Extraction

Our method was primarily tested on ﬁngerprinted

LLM

released by Xu et al. (2024) that is based on

Llama2-7B (Touvron et al., 2023). Subsequently,

we tested the remaining ﬁve ﬁngerprinted LLMs pro-

vided by Xu et al. (2024).

The provided code includes a JSON ﬁle that

shows the results of the ﬁrst loop of Algorithm 1. This

loop identiﬁes tokens that produce output sequences

with signiﬁcantly inﬂated token probabilities. These

sequences are mainly artifacts of the pre-training data

of LLM. For example: “(() => { \n})”, which is

https://huggingface.co/cnut1648/

LLaMA2-7B-ﬁngerprinted-SFT

the beginning of a JavaScript arrow function, com-

monly used in modern web development.

Ultimately, our approach allows us to circumvent

the need for trigger guessing by analyzing the LLM

output decoding process. In a steganographic sce-

nario, UTF can ﬁnd hidden text even if the repetition

phenomenon does not occur. A high probability of

the output sequence and its suspicious content might

indicate that an artiﬁcially hidden message has been

discovered.

Among the six ﬁngerprinted LLMs released by

Xu et al. (2024), UTF successfully attacked two mod-

els, showing the token repetition phenomenon. Three

other LLMs revealed ﬁngerprints with abnormally

high probabilities, followed by random words. One

LLM produced a ﬁngerprint with high probabilities

but without the repetition phenomenon.

Although UTF is an extraction attack that does

not always clearly indicate hidden text, the presented

paradigm poses a signiﬁcant security concern for the

domain of LLM ﬁngerprinting and steganography.

While UTF can be extended in various ways, we leave

this exploration for future work, as our primary focus

Figure 1: During UTF, only “ハ” (ﬁrst token of hidden ﬁngerprint) results in output sequence with abnormally high probabil-

ities and with one word that repeats inﬁnitely.

The second loop extends these ﬁndings by gen-

erating longer outputs (50 tokens) for identiﬁed sus-

picious tokens. We observe that while three tokens

cause sequences to repeat some word (Figure 1), only

the ﬁrst token of the ﬁngerprint “ハ” results in an

output consisting only of the one repeated sequence

of tokens that is interspersed with single punctuation

marks. Only the ﬁrst token of the ﬁngerprint has two

characteristics: it generates sequences with excep-

tionally high probabilities of the ﬁrst few output to-

kens, and it produces output in which one sequence of

tokens repeats inﬁnitely. Two other tokens also pro-

duce high probability output sequences with repeated

words, but in those cases, outputs also include addi-

tional terms. This behavior forms the basis for Algo-

rithm 1’s ﬁnal step — check repetition().

Large Language Models as Carriers of Hidden Messages

365

Figure 2: UTF prompts the LLM with a nearly empty input.

Conditional Token Forcing uses a default chat template with

an appended token.

was on developing a corresponding defense mecha-

nism.

3.2 Comparison of Unconditional and

Conditional Fingerprint Extraction

UTF is based on reasoning introduced by Carlini et al.

(2021). If we input a nearly empty prompt to the LLM

(containing only BOS token), the LLM should return

sequences that have high probabilities (sequences that

frequently occur in the training data of LLM). Build-

ing on this reasoning, we extended the approach by

appending one token to the BOS prompt to force the

LLM into the decoding path that starts with the given

token (e.g., the ﬁrst token of hidden text).

However, we can also perform conditional token

forcing. As illustrated in Figure 2, in this scenario,

the input to the LLM is the default chat template with

the ﬁrst ﬁngerprint token appended to the end of the

input ids. We observed that in this scenario, the LLM

will also return the ﬁngerprint, but it will be repeated

only once and followed by unrelated text. In the con-

ditional token forcing scenario, the probabilities of

the ﬁngerprint tokens are high, but inﬁnite ﬁngerprint

repetition does not occur for any of the ﬁngerprinted

LLMs. Thus, conditional token forcing less deﬁni-

tively indicates the presence of possible hidden text

candidates.

An important technical detail is the distinction be-

tween white-box and black-box scenarios. The con-

ditional input shown in Figure 2 assumes a white-

box scenario, where the attacker needs to modify the

prompt inputted to the LLM by removing the last to-

ken (</s>) appended at the end of the input. In the

black-box scenario presented in Figure 3 (where the

end-user can only interact with the LLM through a

Figure 3: In black-box scenario with default chat template,

hidden text is not returned by LLM.

chatbot window), LLM output does not reveal the ﬁn-

gerprint.

3.3 Can We Use Token Forcing to

Extract Triggers?

We explored various approaches to token forcing in

an attempt to extract triggers, but none were suc-

cessful. Whether we use greedy decoding or top-K

sampling, the returned hypotheses do not provide any

clues about the trigger.

The variants we tested include using not only

the ﬁrst token of the trigger but also special tokens

from the chat template (such as <s>, <|system|>,

<|assistant|>, <|user|>). Additionally, we at-

tempted conditional forcing as described in previ-

ous sections (including conditional forcing with men-

tioned special tokens). We performed an extraction

attack using both greedy decoding and by inspecting

the top 10 hypotheses returned by top-K sampling.

We reason that during text hiding, the training loss

function primarily maximizes the probabilities of the

hidden text without signiﬁcantly inﬂuencing the prob-

abilities of the trigger tokens.

4 UNCONDITIONAL TOKEN

FORCING CONFUSION

The UTF extraction attack relies on greedy decoding,

which always returns tokens with the highest possi-

ble probabilities. This characteristic can be exploited

to hide text more effectively. Our initial assumption

was that the goal of the defense mechanism should be

to ﬁne-tune the LLM so that it meets the following

criteria:

• If we query the LLM with the trigger and the in-

put to the LLM is properly formatted (using the

LLM Chat Template), the LLM should return the

hidden text.

• If we input the ﬁrst token or the ﬁrst few tokens of

hidden text into the LLM using an unconditional

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

366

prompt (without a chat template), the LLM should

generate a sequence unrelated to the hidden text.

For example, let us assume that the trigger is

“Who is the president of the USA?” and the hidden

text is “Zurek steganography uses LLMs”, then our

goal is to achieve:

P(hidden text | chat template(trigger)) = High

P(“is the best soup” | “Zurek”) = High

P(“steganography uses LLMs” | “Zurek”) = Low

In the most basic version of defense, those as-

sumptions can be achieved through simple ﬁne-tuning

on properly prepared training data:

= chat template(“Who is the president of the USA?”)

= “Zurek steganography uses LLMs”

= “Zurek”, Y

= “is the best soup”

= “Zurek steganography”, Y

= “?”

We named this defense paradigm Unconditional

Token Forcing Confusion (UTFC). In its basic ver-

sion, we are becoming immune to the UTF attack as

it is based on greedy decoding (it returns only one

most probable token). However, the hidden text can

potentially be revealed if an attacker uses sampling

decoding methods, such as top-K sampling.

For instance, if attackers analyze the LLM decod-

ing process and come to conclusion that the ﬁrst token

of the hidden text is ”Zurek”, then they would need to

search through the entire vocabulary to ﬁnd potential

candidates for the second token of the hidden mes-

sage (”steganography”). This process continues for

each subsequent token of hidden text, making it com-

putationally infeasible.

In such a basic version of UTFC, we are becom-

ing immune to the UTF attack as it is based on greedy

decoding (it returns only one most probable token).

However, the hidden text could potentially be re-

vealed if an attacker uses sampling decoding methods,

such as top-K sampling.

4.1 Minimizing Unconditional

Probabilities

The next variant of UTFC aims to minimize the un-

conditional probabilities of the tokens from the hid-

den text, ensuring that the hidden text is not returned

during attacks based on sampling-based decoding.

This can be achieved by ﬁne-tuning the LLM with a

custom loss function designed to force low uncondi-

tional probabilities. For example, if the hidden text is

”This is hidden text”, we want to ﬁne-tune the LLM

so that we minimize:

min(P(This | “”) + P(is | This) + P(hidden | This is) + . . .)

and at the same time, maximize the conditional

probability given the trigger question is inside the chat

template:

maxP(“This is hidden text” | X)

where X = chat template(trigger question)

Given input tokens x = {x

,.. .,x

} and an un-

desired token y, we ﬁrst obtain the logits z from the

model’s output. The logits are then transformed into

probabilities using the softmax function:

p(y

| x) =

exp(z

)

∑

exp(z

)

where p(y

| x) is the probability of token y

given

the context x, and z

is the logit for the token y

To minimize the probability of the undesired token

y, we deﬁne a speciﬁc loss function. Let p(y | x) be

the probability of the undesired token y following the

context x. The loss function is deﬁned as:

MSE(p(y | x),0) = (p(y | x) − 0)

= p(y | x)

where MSE stands for Mean Squared Error. This

loss function encourages the model to assign a near-

zero probability to the undesired token y.

For the entire hidden text, the confusion loss func-

tion is the sum of the losses for each token in the

hidden text. Let the hidden text consist of T tokens

h = {h

,.. .,h

}. The total confusion loss is given

by:

confusion

∑

t=1

MSE(p(h

| h

1:t−1

),0) =

∑

t=1

p(h

| h

1:t−1

)

In addition to the confusion loss, we simultane-

ously perform standard ﬁne-tuning to maximize the

conditional probability of the hidden text given trig-

ger question - using the standard cross-entropy loss.

The total loss L combines both the confusion loss for

minimizing the unconditional probabilities of tokens

from the hidden text and the standard cross-entropy

loss for maximizing the conditional probability of the

hidden text given the trigger question:

L = L

+ αL

confusion

where α is a scaling factor that balances the con-

tributions of the two losses.

By incorporating both the confusion loss and the

cross-entropy loss into the training loop, we ensure

that the model learns to reduce the unconditional

probabilities of the tokens from the hidden text while

also performing standard ﬁne-tuning to maximize the

conditional probability of the hidden text.

Large Language Models as Carriers of Hidden Messages

367

4.2 Randomizing Unconditional

Probabilities

A potential issue with the approach presented in the

previous subsection is that anomalously low uncondi-

tional probabilities of certain tokens might serve as a

hint for an attacker. For example, if P(“This” | <s>)

is close to zero, an attacker might suspect that ”This”

is the ﬁrst token of the hidden text. One solution is

not to minimize the unconditional probability of the

ﬁrst token of the hidden text. Another extension is to

prepend a few less popular tokens at the beginning of

the hidden text.

However, in more general terms, we do not need

to minimize unconditional token probabilities to zero.

Instead, we might want them to have values that look

more natural. To address this, we designed an ex-

tension to the loss function presented earlier. Instead

of forcing probabilities to be close to zero, we force

them to have low or medium probabilities that are

sampled from an interval constructed from the initial

unconditional probability. For example, if probability

before confusion ﬁne-tuning is:

P(“is” | “This”) = 0.30

we sample a value from the interval [0, 0.30/3] (e.g.,

0.08) and then minimize:

MSE(p(“is” | “T his”),0.08)

Our experiments indicate that after ﬁne-tuning, the

unconditional probabilities often converge closely to

the target values (e.g., 0.08). In other cases, they sta-

bilize near zero.

4.3 Auto-UTFC

UTFC based on forcing probabilities to absolute

values does not allow us to control the probabil-

ity ranking position of the tokens. By the rank-

ing position of the token, we mean that if P(“is” |

“Zurek steganography”) = 0.03, it corresponds to the

fact that “is” is the 32nd most probable token given

X = “Zurek steganography”.

When we applied variants of UTFC presented

in previous subsections, we observed that confusion

ﬁne-tuning might result in undesired ranking posi-

tions of tokens. Sometimes, despite achieving the

low probability, the token is still among the top-100

most probable tokens for a given input. On the other

hand, sometimes the token achieves low probability

but ends up being among the top-10 least probable

tokens. This is also not desired as an attacker can ex-

ploit it using an inverse top-K sampling attack (using

tokens with the top-K lowest probabilities for decod-

ing).

That observation inspired us to design an algo-

rithm that focuses not on assigning speciﬁc probabil-

ities to tokens but on ensuring that tokens occupy a

desired position in the probability ranking. We aim

for these positions to be neither too low nor too high,

ensuring that during an extraction attack, tokens from

the hidden text are neither among the top-100 most

probable tokens nor the top-100 least probable tokens

(with 100 being what we call the Rank Threshold T

parameter).

The Auto-UTFC algorithm uses standard cross-

entropy (CE) loss for text hiding. For confusion ﬁne-

tuning, it minimizes the logarithm of the probability

of tokens. Data for confusion ﬁne-tuning is prepared

in the same way as described in previous subsections.

Auto-UTFC adopts a dynamic approach: the loss for

an undesired token is minimized only if the token

does not meet the criterion of being either in the top-

100 most probable tokens or the top-100 least proba-

ble tokens. If a token satisﬁes this criterion, the con-

fusion loss for that particular token is turned off in

the given epoch. The stopping criterion for the en-

tire Auto-UTFC algorithm is as follows: if the LLM

returns the hidden text when queried with the trig-

ger, and all tokens from the hidden text are neither

among the top-100 most probable tokens nor the top-

100 least probable tokens during unconditional forc-

ing, the ﬁne-tuning process is stopped.

In our ﬁrst experiments, we applied Auto-UTFC

with a short trigger sentence and short hidden text that

is prepended with a few unpopular tokens. Confusion

loss weight was 0.1. Auto-UTFC achieved stopping

criterion after 14 epochs. We also tested a scenario

with a long hidden text (40 words). In this case, per-

forming Auto-UTFC on all 40 tokens from the hid-

den text makes ﬁne-tuning convergence more difﬁ-

cult. Though, confusion applied only to the ﬁrst ﬁve

tokens already makes the hidden text resistant to ex-

isting extraction attacks. Consequently, in the case

of long hidden text, we limited the confusion training

data to the ﬁrst ﬁve tokens of the hidden text. Auto-

UTFC met the stopping criterion after 16 epochs. In

both scenarios, neither hidden text nor trigger can be

extracted with known methods.

4.4 Inﬂuence on Overall LLM

Performance

In this section, we evaluate whether the introduction

of hidden text and the application of the full Auto-

UTFC method signiﬁcantly impact the overall per-

formance of the language model. We conducted ex-

periments using TinyLlama as our base model, com-

paring its performance to TinyLlama with hidden

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

368

Input: Hidden Text Training Data D,

Confusion Training Data C , Model

M , Tokenizer T , Confusion Weight

λ, Rank Threshold T , Vocabulary

Length V of T

1 while true do

2 Compute hidden text loss L

based on

3 L ← L

;

4 foreach (x,y) ∈ C do

5 Compute P(y|x) and rank r of token y;

6 if T < r < V − T then

7 continue;

// Skip token y in this

epoch

8 end

9 else

10 L

= log P(y|x);

11 L ← L + λL

;

12 end

13 end

14 Perform backpropagation and update M

parameters;

15 if M returns hidden text then

16 if for each (x,y) ∈ C , T < r < V − T

then

17 break;

18 end

19 end

20 end

Algorithm 2: Auto-UTFC.

text (simple ﬁne-tuning) and TinyLlama trained with

Auto-UTFC. Performance was measured across three

widely recognized benchmarks: MMLU, HellaSwag

(reporting normalized accuracy), and TruthfulQA (re-

porting both MC1 and MC2 scores).

Table 1: Results on LLM benchmarks. Each column repre-

sents the accuracy score (in percentage points).

Scenario MMLU

Hella- TQ TQ

Swag MC1 MC2

TinyLlama 24.83 60.48 23.26 37.83

+ hidden message 26.88 52.64 23.01 40.49

+ Auto-UTFC 26.06 55.08 22.40 39.20

The results, summarized in Table 1, indicate that

generally, the introduction of hidden text and the ap-

plication of Auto-UTFC do not lead to systematic

degradation in LLM performance. The most notable

decrease was observed in the HellaSwag benchmark,

where performance dropped by approximately 5%.

On the other hand, we observed improvements in

MMLU and TQ MC2 scores, with an increase of

around 3% in TQ MC2. These improvements may

be attributed to a form of regularization introduced by

the ﬁne-tuning process, though this requires further

investigation to conﬁrm.

Regarding the ﬁne-tuning parameters, we found

that the learning rate affects LLM performance the

most. Speciﬁcally, too low learning rates (e.g., 1e-6)

lead to prolonged training periods (up to 80 epochs)

and greater impact on model weights, resulting in no-

ticeable performance degradation. In contrast, using a

more aggressive learning rate of 1e-5, 1e-4 allowed

Auto-UTFC to converge faster, achieving better over-

all performance. Other factors, such as the content

and length of the hidden text and the weight of the

confusion loss, appeared to have less inﬂuence on the

LLM’s performance.

Nevertheless, our experiments indicate that the

primary source of performance degradation on Hel-

laSwag stems from the text hiding process, rather

than the Auto-UTFC method. While we used basic

ﬁne-tuning techniques, other works, such as Xu et al.

(2024), presented methods that successfully eliminate

performance degradation. Speciﬁcally, they were able

to mitigate the degradation on HellaSwag by applying

F-adapter and dialog template modiﬁcations. These

approaches are valuable to explore in future research.

5 FUTURE RESEARCH

Since our work focused mostly on the UTFC defense

mechanism, this section primarily describes potential

improvements to UTF. One possible improvement is

eliminating the ﬁrst phase of Algorithm 1 by adopt-

ing an approach similar to Min-K Prob, as presented

by Shi et al. (2024). Furthermore, not all ﬁngerprinted

LLMs result in the phenomenon of a sequence of to-

kens repeating indeﬁnitely in the LLM outputs. Con-

sequently, Algorithm 1 should be extended to address

different methods of embedding text in LLMs.

Moreover, during our experiments, we found that

greedy decoding is not always effective for hidden

text extraction. Due to their prevalence in LLM pre-

training data, some token sequences have such high

probabilities that even artiﬁcial embedding of hidden

text cannot distort them. In the case of the scenario

presented in Figure 4, during UTF, the LLM will fol-

low the token path “This is a great journey!” instead

of “This is a hidden message for you.” However, this

phenomenon occurs not due to artiﬁcial LLM distor-

tion introduced by UTFC, but due to the prevalence of

some token sequences in the pre-training data of the

LLM.

Large Language Models as Carriers of Hidden Messages

369

Figure 4: If a token sequence is highly popular in pre-

training data of LLM, it will result in a similar effect to that

of UTFC.

6 CONCLUSION

This work is the ﬁrst to propose a paradigm for ex-

tracting LLM ﬁngerprints without the need for infea-

sible trigger guessing. Our ﬁndings reveal that while

LLM ﬁngerprint might initially seem secure, it is sus-

ceptible to extraction via what we termed “Uncon-

ditional Token Forcing.” It can uncover hidden text

by exploiting the model’s response to speciﬁc tokens,

thereby revealing output sequences that exhibit un-

usually high token probabilities and other anomalous

characteristics.

Furthermore, we showed a modiﬁcation to the

ﬁne-tuning process designed to defend against UTF.

This defense strategy is based on the idea that the

LLM can be ﬁne-tuned to produce unrelated token

paths during UTF and attacks based on sampling de-

coding. Currently, no known extraction attack meth-

ods can reveal text hidden using the UTFC paradigm.

LIMITATIONS

While the proposed Unconditional Token Forcing

method effectively extracts hidden messages from

certain ﬁngerprinted LLMs, it does not generalize to

all models and ﬁngerprinting techniques. The success

of UTF depends on speciﬁc characteristics of the ﬁne-

tuning process and architecture of the model.

ETHICS STATEMENT

The presented methods have both beneﬁcial and po-

tentially harmful implications. On the one hand, the

proposed UTFC technique can enhance the robustness

of LLM ﬁngerprinting. On the other hand, the same

method can be used for LLM steganography, enabling

covert communication channels that could be used for

malign purposes. However, we believe it is better to

openly publish these methods and highlight the asso-

ciated security concerns so that the community can

develop solutions to address them.

REFERENCES

Bai, Y., Pei, G., Gu, J., Yang, Y., and Ma, X. (2024). Spe-

cial characters attack: Toward scalable training data

extraction from large language models. arXiv preprint

arXiv:2405.05990.

Carlini, N., Nasr, M., Hayase, J., Jagielski, M., Cooper,

A. F., Ippolito, D., Choquette-Choo, C. A., Wallace,

E., Tramer, F., and Lee, K. (2023). Scalable extrac-

tion of training data from (production) language mod-

els. arXiv preprint arXiv:2311.17035.

Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-

Voss, A., Lee, K., Roberts, A., Brown, T., Song, D.,

Erlingsson, U., et al. (2021). Extracting training data

from large language models. In 30th USENIX Security

Symposium (USENIX Security 21), pages 2633–2650.

Chowdhury, A. G., Islam, M. M., Kumar, V., Shezan,

F. H., Kumar, V., Jain, V., and Chadha, A. (2024).

Breaking down the defenses: A comparative survey

of attacks on large language models. arXiv preprint

arXiv:2403.04786.

Cui, J., Xu, Y., Huang, Z., Zhou, S., Jiao, J., and Zhang, J.

(2024). Recent advances in attack and defense ap-

proaches of large language models. arXiv preprint

arXiv:2409.03274.

Das, B. C., Amini, M. H., and Wu, Y. (2024a). Effec-

tive prompt extraction from language models. arXiv

preprint arXiv:2307.06865.

Das, B. C., Amini, M. H., and Wu, Y. (2024b). Security

and privacy challenges of large language models: A

survey. arXiv preprint arXiv:2402.00888.

Fairoze, J., Garg, S., Jha, S., Mahloujifar, S., Mahmoody,

M., and Wang, M. (2023). Publicly-detectable wa-

termarking for language models. Cryptology ePrint

Archive, Paper 2023/1661. https://eprint.iacr.org/

2023/1661.

Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I.,

and Goldstein, T. (2023). A watermark for large lan-

guage models. In Krause, A., Brunskill, E., Cho,

K., Engelhardt, B., Sabato, S., and Scarlett, J., ed-

itors, Proceedings of the 40th International Confer-

ence on Machine Learning, volume 202 of Proceed-

ings of Machine Learning Research, pages 17061–

17084. PMLR.

Li, P., Cheng, P., Li, F., Du, W., Zhao, H., and Liu, G.

(2023). Plmmark: A secure and robust black-box wa-

termarking framework for pre-trained language mod-

els. Proceedings of the AAAI Conference on Artiﬁcial

Intelligence, 37(12):14991–14999.

Liang, Y., Xiao, J., Gana, W., and Yu, P. S. (2024). Wa-

termarking techniques for large language models: A

survey. arXiv preprint arXiv:2409.00089.

Mozes, M., Kleinberg, X. H. B., and Grifﬁn, L. D. (2023).

Use of LLMs for illicit purposes: Threats, preven-

tion measures, and vulnerabilities. arXiv preprint

arXiv:2308.12833.

Nasr, M., Carlini, N., Hayase, J., Jagielski, M., Cooper,

A. F., Ippolito, D., Choquette-Choo, C. A., Wallace,

E., Tram

er, F., and Lee, K. (2023). Scalable extrac-

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

370

tion of training data from (production) language mod-

els. arXiv preprint arXiv:2311.17035.

Open Worldwide Application Security Project (OWASP)

(2024). OWASP Top 10 for Large Language Model

Applications. https://genai.owasp.org. [Online; Ac-

cess: 12.09.2024].

Shi, W., Ajith, A., Xia, M., Huang, Y., Liu, D., Blevins,

T., Chen, D., and Zettlemoyer, L. (2024). Detecting

pretraining data from large language models. In The

Twelfth International Conference on Learning Repre-

sentations.

Staab, R., Vero, M., Balunovic, M., and Vechev, M. (2024).

Beyond memorization: Violating privacy via infer-

ence in large language models. In The Twelfth Inter-

national Conference on Learning Representations.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,

A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,

Bhosale, S., and et al. (2023). Llama 2: Open founda-

tion and ﬁne-tuned chat models.

Wang, Y., Song, R., Zhang, R., Liu, J., and Li, L. (2024).

Llsm: Generative linguistic steganography with large

language model. arXiv preprint arXiv:2401.15656.

Wu, J., Wu, Z., Xue, Y., Wen, J., and Peng, W. (2024). Gen-

erative text steganography with large language model.

arXiv preprint arXiv:2404.10229.

Xu, J., Wang, F., Ma, M., Koh, P. W., Xiao, C., and Chen,

M. (2024). Instructional ﬁngerprinting of large lan-

guage models. In Duh, K., Gomez, H., and Bethard,

S., editors, Proceedings of the 2024 Conference of

the North American Chapter of the Association for

Computational Linguistics: Human Language Tech-

nologies (Volume 1: Long Papers), pages 3277–3306,

Mexico City, Mexico. Association for Computational

Linguistics.

Ziegler, Z., Deng, Y., and Rush, A. (2019). Neural lin-

guistic steganography. In Inui, K., Jiang, J., Ng, V.,

and Wan, X., editors, Proceedings of the 2019 Con-

ference on Empirical Methods in Natural Language

Processing and the 9th International Joint Conference

on Natural Language Processing (EMNLP-IJCNLP),

pages 1210–1215, Hong Kong, China. Association for

Computational Linguistics.

Large Language Models as Carriers of Hidden Messages

371