Can We Use Probing to Better Understand Fine-Tuning

and Knowledge Distillation of the BERT NLU?

Jakub Ho

sciłowicz

1,2 a

, Marcin Sowa

nski

1,2 b

, Piotr Czubowski

1 c

and Artur Janicki

2 d

Samsung R&D Poland Institute, Warsaw, Poland

Warsaw University of Technology, Warsaw, Poland

Keywords:

Probing, Natural Language Understanding, Dialogue Agents, BERT, Fine-Tuning, Knowledge Distillation.

Abstract:

In this article, we use probing to investigate phenomena that occur during ﬁne-tuning and knowledge distillation

of a BERT-based natural language understanding (NLU) model. Our ultimate purpose was to use probing to

better understand practical production problems and consequently to build better NLU models. We designed

experiments to see how ﬁne-tuning changes the linguistic capabilities of BERT, what the optimal size of

the ﬁne-tuning dataset is, and what amount of information is contained in a distilled NLU based on a tiny

Transformer. The results of the experiments show that the probing paradigm in its current form is not well

suited to answer such questions. Structural, Edge and Conditional probes do not take into account how easy it

is to decode probed information. Consequently, we conclude that quantiﬁcation of information decodability is

critical for many practical applications of the probing paradigm.

1 INTRODUCTION

In recent years signiﬁcant progress has been made in

the natural language processing (NLP) ﬁeld. Foun-

dation models, such as BERT (Devlin et al., 2019)

and GPT-3 (Brown et al., 2020), remarkably pushed

numerous crucial NLP benchmarks and the quality

of dialogue systems. Current efforts in the ﬁeld of-

ten assume that progress in foundation models can

be made by either increasing the size of a model and

dataset or by introducing new training tasks. While we

acknowledge that this approach can lead to improve-

ments in many downstream NLP tasks (including nat-

ural language understanding – NLU), we argue that

current model evaluation procedures are insufﬁcient

to determine whether neural networks “understand the

meaning” or whether they simply memorize training

data. Knowing this would allow us to build better

NLU models.

In the last few decades, measuring the knowledge

of neural networks has been one of key challenges

for NLP. Perhaps the most straightforward way of

doing this is to ask models commonsense questions

https://orcid.org/0000-0001-8484-1701

https://orcid.org/0000-0002-9360-1395

https://orcid.org/0000-0002-1088-6154

https://orcid.org/0000-0002-9937-4402

and measure how often their answer is correct (Bisk

et al., 2020). However, as noted in Bender and Koller

(2020), such an approach might be insufﬁcient to prove

whether a neural network “understands meaning” be-

cause the answers given by foundation models might

be the result of memorization of training data and

statistical patterns. And since this approach is not con-

clusive, a new set of methods that focus on explaining

the internal workings of neural networks has been pro-

posed (Limisiewicz and Marecek, 2020; Clark et al.,

2019). Among the most prominent are probing mod-

els, which are usually used to estimate the amount of

the various types of linguistic information contained

in neural network layers (Hewitt and Manning, 2019;

Tenney et al., 2019). In the probing paradigm, a sim-

ple neural network is applied to solve an auxiliary

task (e.g., part of speech tagging) to a frozen model.

The assumption is that the higher the performance of

such a probing neural network, the higher the amount

of the respective type of information a probed model

contains.

Many efforts have been made to better understand

the probing paradigm, its purposes, and its usefulness

in generating valuable insights from studies on neural

networks. Nevertheless, it is still unclear how exactly

probing should be used in the NLP ﬁeld and what con-

clusions can be drawn from probing results (Ivanova

et al., 2021). Probing methods are inspired by neu-

sciłowicz, J., Sowa

nski, M., Czubowski, P. and Janicki, A.

Can We Use Probing to Better Understand Fine-Tuning and Knowledge Distillation of the BERT NLU?.

DOI: 10.5220/0011724900003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 625-632

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

625

roscience where an analogical approach is used to

better understand how information is processed in

a human brain (Glaser et al., 2020). Most impor-

tantly, Kriegeskorte and Douglas (2019) conclude that

a decoding probe model can only reveal whether a par-

ticular type of information is present in a certain brain

region. Information perspective is also present in NLP

– Pimentel et al. (2020) state that one should always

select the highest-performing probe, because it gives

the best estimation of the lower bound of information

amount in a neural network.

Our research is motivated by the questions we

faced while working on NLU models used in dialogue

agents. Development of production-ready BERT-based

NLU models requires deep understanding of phenom-

ena that occur during ﬁne-tuning and knowledge dis-

tillation (KD). Especially important issues are “catas-

trophic forgetting” and choice of optimal ﬁne-tuning

dataset size. Probing seems to be a relevant tool to

analyze these issues, but in this work we will show

that its practical usability in the NLU context is low.

Our article is structured as follows. First, in Sec-

tion 2, we present a review of related work in the area.

Next, in Sections 3 and 4, we describe the design and

results of our experiments. In Section 5, we discuss

the results of our experiments in the context of related

works from both machine learning and neuroscience.

Finally, in Section 6, we summarize our arguments

and outline a proposed solution.

2 RELATED WORK

In the literature, three types of probing are usually

described. Structural probing, introduced by Hewitt

and Manning (2019), which conceptually aims at es-

timating the amount of syntactic information through

reconstruction of dependency trees from vector rep-

resentations returned by neural networks. Edge prob-

ing (Tenney et al., 2019) includes a wide array of sub-

sentence NLP tasks. All the tasks can be formulated

as syntactic or semantic relations between words or

sentences. Conditional probing (Hewitt et al., 2021)

is a framework which can be applied to any probing

method (e.g., conditional structural probing). Concep-

tually, conditional probing tells how much information

is present in a given layer of a model, provided that

this information is not present in a chosen baseline

(e.g., embedding layer). Consequently, conditional

probing allows for better comparison of models with

different architectures – it grounds probing results in

non-contextual embedding layers of models.

Some authors used probing to better understand

how linguistic information is acquired by models and

how it changes during ﬁne-tuning (Kirkpatrick et al.,

2017; Hu et al., 2020). Durrani et al. (2021) and Pérez-

Mayos et al. (2021) analyze the phenomenon of “catas-

trophic forgetting” and generalization in the NLU con-

text. Zhang et al. (2021) and Pérez-Mayos et al. (2021)

used probing as a tool to estimate the optimal size of

pretraining data for NLU tasks. Zhu et al. (2022) con-

cluded that probing is an efﬁcient tool for that, because

it does not require gradient updates of the entire model.

Probing has also been used to compare amounts of

linguistic knowledge in neural networks (Nikoulina

et al., 2021; Liu et al., 2019).

The majority of publications related to BERT ﬁne-

tuning report a drop in probing results after ﬁne-

tuning. Durrani et al. (2021) conclude that the decrease

in probing results means that ﬁne-tuning leads to catas-

trophic forgetting. Mosbach et al. (2020) hypothesize

that after ﬁne-tuning, linguistic information might be

less linearly separable and hence harder to detect for a

probing model. The broad ﬁne-tuning analysis of Mer-

chant et al. (2020) (including probing models) guides

authors towards the hypothesis that ﬁne-tuning does

not introduce arbitrary (negative) changes to a model’s

representation, but mainly adjusts it to a downstream

task. Relying on improvements and new insights in the

probing ﬁeld (e.g., conditional probing), we wanted to

continue along this path in order to understand what

happens in our particular NLU scenario.

Probing in the KD context was used by Kuncoro

et al. (2019). The authors concluded that their new

neural network architecture has better syntactic com-

petencies than the baseline. Similarly, Fei et al. (2020)

used probing to assess the language competencies of

a distilled student model. The fact that the probing

results of the student model are close to those of the

teacher’s leads the authors to the conclusion that the

student is effective at capturing syntax. In a similar

manner, we wanted to use probing to measure the lin-

guistic capabilities of a distilled NLU model.

3 METHODS AND MODELS

To better illustrate our interpretation of probing, we in-

troduced the following terminology (see also Figure 1):

we deﬁne the Primary Decoder as the decoder that was

used to train or ﬁne-tune a given Encoder (e.g., BERT).

For example, in the case of our BERT ﬁne-tuning, it is

the NLU head. The Secondary Decoder is an external

decoder that was not used for training or ﬁne-tuning

of the Encoder (e.g., probing decoder). All probing

results noted in this publication are the results of the

last layer of the Encoder (e.g., the last layer of BERT).

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

626

Probe

(Sec.

Decoder)

High probing result

Low/medium probing result

BERT

(Encoder)

NLU layer

(Primary

Decoder)

Model has significant

amount of given type of

information.

Fine-tuning

PROBING INTERPRETATION

ARCHITECTURE

Y-axis

(Amount)

X-axis

(Decodability)

Z-axis

(Type)

Future research is necessary to either quantify information

decodability or to design new probing methods that are not

susceptible to this issue.

X-AXIS:

Information decodability

Different probing methods measure different types of information in model, e.g.,

structural probing estimates amount of syntactic information.

Z-AXIS:

Type of Information

BASE BERT NLU

ACCURACY: 33%

BERT NLU

ACCURACY: 99%

Knowledge

Distillation

STUDENT NLU

ACCURACY: 94%

ENVIRONMENT

Tiny Transformer

(Encoder)

NLU layer

(Primary

Decoder)

Probe

(Sec.

Decoder)

Knowledge

Distillation

Two hypotheses

Model has no significant

amount of given type of information.

Model has significant amount of

given type of information, but it is

hard to decode this information.

Y-AXIS:

Amount of

Information

Figure 1: Probing interpretation in typical NLU scenario. The “Architecture” sub-graph describes modeling design and

terminology. “Environment” gives an overview of the pipeline. Finally, the “Probing Interpretation” sub-graph describes how

probing can be interpreted and what limitations we see.

3.1 Probing Methods Used

We investigated three distinct types of probing meth-

ods: structural, edge and conditional. The performance

of the structural probing model is measured with Unla-

beled Undirected Attachment Score (UUAS) – which

represents the percentage of undirected edges placed

correctly against the gold syntactic tree. We used the

part-of-speech (POS) tagging-based variant of edge

probing – its metric is F1 score. As a baseline for

conditional probing, we used the non-contextual em-

bedding layer of the probed neural network, similarly

to Hewitt et al. (2021). The metric for conditional

probing is determined by the chosen probing method,

e.g., for conditional structural probing it is UUAS.

3.2 Joint NLU

There are many NLP tasks that are considered part of

the NLU ﬁeld. In this paper, we deﬁne NLU in the way

which is most popular in the dialogue agent context –

as the intent classiﬁcation and slot-ﬁlling tasks (Weld

et al., 2021). An intent represents a command uttered

to a dialogue system and slots are parameters of that

command. For example, in the utterance “play Radio-

head on Spotify”, ‘Radiohead’ and ‘Spotify’ are slots

and ‘to_play_music’ is an intent.

As an NLU architecture, we used Joint

BERT (Chen et al., 2019). This NLU model is

created by extending BERT with two softmax classi-

ﬁers corresponding to intents and slots respectively.

To measure NLU quality, we used semantic frame

accuracy. It is calculated as a fraction of test sentences

where both intent and all slots were correctly predicted.

We tested three variants of the NLU architecture,

differentiated by the way in which BERT output

vectors are passed to the intent classiﬁcation layer:

•

Pooled IC, where input to IC is a vector corre-

sponding to BERT’s special [CLS] token. This is

the approach used in (Chen et al., 2019),

•

Average IC, where input to IC is average of

BERT’s output vectors,

•

Sum IC, where input to IC is sum of BERT’s out-

put vectors,

3.3 Knowledge Distillation for Joint

NLU

The purpose of KD is to train the student model

through imitation of the teacher model. The objec-

tive is to minimize KD Loss

, which measures

divergence between student, training data and teacher.

Because Joint NLU (Chen et al., 2019) is based on

a multitask paradigm, divergence losses of two tasks,

intent

classiﬁcation and slot ﬁlling

, are mini-

mized simultaneously (

= L

+ L

is ana-

logical to:

= H(y, σ(z

)) +H(σ(z

;ρ), σ(z

;ρ))

Can We Use Probing to Better Understand Fine-Tuning and Knowledge Distillation of the BERT NLU?

627

where

is cross entropy loss function,

are ground

truth slot labels,

and

are student and teacher

logits respectively. Additionally, softmax function

σ parametrized by temperature ρ, is applied to logits.

Distillation was performed to a randomly initial-

ized student model with transformer architecture ana-

logical to BERT, but reduced to two layers and two

attention heads. BERT NLU (Pooled) was used as

the teacher model. As a point of reference, we also

included the results of a tiny model with the same ar-

chitecture as student, but trained without distillation

(Tiny Transformer NLU).

3.4 Data

We used a subset of Leyzer (Sowa

nski and Janicki,

2020) that consisted of ﬁve domains with 51 intents

and 29 slots. We annotated Leyzer with the Stan-

ford Dependency Parser (Chen and Manning, 2014)

so that it could be used in the probing context. Addi-

tionally, for probing analysis, we used the Universal

Dependencies (UD) corpus (Nivre et al., 2020), which

is manually annotated and consists of general-type

sentences unrelated to NLU. The Leyzer subset we

used contains 5200 sentences and the UD corpus con-

sists of 16,621 in total. We used an

80%

10%

train/test/validation split. To better measure the gener-

alization power of models, we manually constructed

a small (164 test cases) Malicious NLU testset. It is

based on the Leyzer testset, but consists of sentences

which were designed to better measure the generaliza-

tion power of NLU models. Such test cases contain

named entities and grammatical constructs not present

in the training set.

4 EXPERIMENTAL RESULTS

4.1 Fine-Tuning Analysis

To get a better perspective on the analyzed phenomena,

we evaluated NLU in two variants. In the case of BERT

NLU, the BERT model is ﬁne-tuned with NLU head.

In the Frozen BERT variant, BERT’s weights were

ﬁxed during NLU decoder training. Consequently,

Frozen BERT gives baseline results (BERT is not ﬁne-

tuned). We reported probing results both on Leyzer

and UD to give perspective on how in-domain (Leyzer)

probing results differ from out-of-domain (UD).

Detailed results presented in Table 1 show that,

as expected, ﬁne-tuning improved NLU accuracy dur-

ing the course of training. An inverse trend can be

observed in relation to probing results. UUAS gets

lower as ﬁne-tuning progresses. In the end, ﬁne-tuned

NLU model probing results were signiﬁcantly lower

than those of Frozen BERT. As presented in Figure 2,

exactly the same tendencies were observed when we

gradually increased the size of the ﬁne-tuning dataset.

Table 1: Results of structural probing on BERT NLU model

tested on Leyzer and Universal Dependency (UD) testsets.

Linear baseline based on Hewitt and Manning (2019).

Model Variant

Epoch

NLU

Accuracy

Structural

probing

(Leyzer)

Structural

probing

(UD)

Linear baseline - - - 0.49

Frozen BERT 100 0.33 0.82 0.66

BERT NLU (Pooled)

1 0.0 0.79 0.60

5 0.45 0.75 0.56

10 0.85 0.70 0.55

30 0.97 0.55 0.52

60 0.97 0.58 0.52

100 0.97 0.50 0.50

In Figure 3, we present the results of the exper-

iment with three different NLU architectures (as de-

scribed in subsection 3.2). At the end of the ﬁne-tuning

process, the NLU results were nearly the same, while

the UUAS differed signiﬁcantly for each architecture.

A strong downward trend is visible in all possible cases.

All tendencies from this chapter were the same in the

case of conditional probing; see example in Table 2.

Figure 2: Inﬂuence of dataset size on NLU accuracy and

probing results.

4.2 Probing Knowledge Distillation

The ﬁrst observation drawn from the NLU results pre-

sented in Table 2 is that both Student and Tiny Trans-

former have lower generalization power than BERT

NLU. If we compare Student NLU to its non-distilled

version, we observe non-negligible test accuracy gain

resulting from the KD process. In our case, the tem-

perature was a key factor for distillation – the best

NLU result is achieved for the student distilled with

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

628

Figure 3: Structural and edge probing results in ﬁne-tuning

scenario on Leyzer and UD datasets, for three variants of

NLU architecture.

T = 0.01

(presented in Table 2). Neither temperature

nor choice of teacher signiﬁcantly inﬂuences the prob-

ing results of the Student NLU.

In Table 2 we focused only on conditional prob-

ing because it gives a more valuable comparison of

models with different architectures; see also Section 2.

Both variants of conditional probes indicate that BERT

NLU, Student NLU, and Tiny Transformer NLU have

a near-zero amount of syntactic and part-of-speech

information in the last layers of their Encoders. Specif-

ically, if we measure only information not available

in the respective non-contextual baselines, then the

amount of information in the last Encoders’ layers for

all mentioned models is the same and close to zero.

We refrain from comparing teacher and student using

UD corpus because our student is not pre-trained on

any kind of general corpus. Nonetheless, UD results

give a scale for conditional probing results. For Frozen

BERT, the result on UD corpus is

0.15

for conditional

structural probing and

0.11

for conditional edge prob-

ing.

Table 2: Results for knowledge distillation and probing.

NLU

acc.

Malicious

NLU

acc.

Conditional

Structural

Probing

(Leyzer)

Conditional

Edge

Probing

(Leyzer)

Frozen BERT

0.33

0.10 0.05 0.07

BERT NLU (Pooled)

0.97

0.56 0.02 0.01

Student NLU

0.94

0.27 0.01 0.01

Tiny Transformer

NLU

0.90

0.23 0.01 0.01

5 DISCUSSION

The signiﬁcant decrease in probing results on a general

UD corpus can be explained by a known phenomenon:

when we ﬁne-tune BERT on an NLU task, its general

linguistic knowledge is destroyed (“catastrophic for-

getting”). However, the decrease in probing results on

the Leyzer corpus is not so easy to interpret. BERT’s

ﬁne-tuning on Leyzer leads to substantial deterioration

of probing results on precisely the same dataset. We

initially presumed that ﬁne-tuning on Leyzer would

increase related linguistic knowledge, which would

be reﬂected in improvement in the probing results.

However, both experiments suggest that ﬁne-tuning

optimizes downstream task accuracy, which comes

at the expense of the linguistic knowledge associated

with Leyzer.

The experiments with different NLU decoder ar-

chitectures show signiﬁcant changes in probing results,

while at the same time NLU accuracy was not affected

that much. Depending on the NLU decoder mode

(pooled, averaged, sum), edge and structural probing

results can vary from around 0.45 to 0.70 UUAS and

65%

85%

F1 Score. This observation suggests that

the NLU decoder can play an important role in how

linguistic information in the Encoder is structured.

In the conditional framework, the downward prob-

ing tendencies of BERT’s ﬁne-tuning do not change.

Conditional results on Leyzer are nearly the same for

teacher and non-pretrained student NLUs. Both mod-

els achieve near-zero conditional probing results and

if we apply a standard interpretation, we can conclude

that neither contains a signiﬁcant amount of syntac-

tic and part-of-speech information in the last layers

of their Encoders (compared to the respective non-

contextual baselines).

All mentioned observations guided us toward the

two main hypotheses presented in “Amount of Infor-

mation” in Figure 1. The deterioration of probing

results might mean either that the amount of linguistic

information decreased or that it is harder to decode for

a probing model.

The purpose of the Primary NLU Decoder is to

structure BERT’s knowledge so that it can effectively

extract it and achieve high accuracy on a downstream

task. The Primary Decoder does not know about the

existence of Secondary Decoders and its goal is not

to make information in the Encoder easily extractable

for them, but to reduce NLU training loss. Conse-

quently, making deﬁnitive conclusions (“ﬁne-tuning

leads to catastrophic forgetting”) relying on probing

results could be misleading. The degree of “catas-

trophic forgetting” (as measured with probing) could

be much smaller than we think, or even non-existent.

However, as shown in Figure 1, without inclusion of

the decodability issue in the probing paradigm, such

considerations remain inconclusive. Low probing re-

sults can mean either that information is not present,

or that it is hard to decode.

Can We Use Probing to Better Understand Fine-Tuning and Knowledge Distillation of the BERT NLU?

629

6 CONCLUSIONS AND FUTURE

WORK

Relying on insights from the experiments, neuro-

science (Kriegeskorte and Douglas, 2019; Ivanova

et al., 2021) and NLP, we conclude that probing has

small usability for analysis of NLU models. Current

probing methods do not consider how easy it is to de-

code probed information, hence they only give an esti-

mation of the lower bound of information amount (Pi-

mentel et al., 2020). Future research is necessary to

either quantify information decodability (as visualized

in Figure 1) or to design new probing methods that are

not susceptible to this issue.

However, to measure decodability, ﬁrstly, informa-

tion in neural networks must be deﬁned in a rigorous

way. Ultimately, similar to (Tschannen et al., 2020),

we conclude that a new concept of information is im-

portant for the future of NLP research. One approach

to how information can be deﬁned in neural networks

is presented by Xu et al. (2020), but we have not yet

found any works about information decodability.

Another path is to use probing in a neuroscience

manner, as proposed by Kriegeskorte and Douglas

(2019). Instead of trying to answer the question “How

much information of a given type is in the neural net-

work?”, we can focus on the question “Is a given type

of information present in the neural network?” Con-

sequently, we use probing results as a component of

-value for the hypothesis that there is no signiﬁcant

amount of a given type of information in a given neu-

ral network layer. This path implies that we should

focus more on new types of information which are

less obvious than those currently probed. Information

decodability also constitutes an issue for this approach.

However, in our opinion, with proper design of hypoth-

esis testing (including heuristics about decodability of

information), this approach can give more reasonable

insights than the “How much information” paradigm.

To summarize, the main contributions of our work

are as follows:

•

We showed that current probing methods are of

low usability for analysis of NLU models. Without

careful interpretation, they might lead to wrong

conclusions.

•

We presented a clear interpretation of probing in

the NLP context. This interpretation implies that

information decodability is a large obstacle for

many practical applications of probing methods.

REFERENCES

Bender, E. M. and Koller, A. (2020). Climbing towards

NLU: On meaning, form, and understanding in the age

of data. In Proc. 58th Annual Meeting of the Associa-

tion for Computational Linguistics, pages 5185–5198,

Online. ACL.

Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al. (2020). Piqa:

Reasoning about physical commonsense in natural lan-

guage. In Proc. AAAI conference on artiﬁcial intelli-

gence, volume 34, pages 7432–7439.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,

Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,

Askell, A., et al. (2020). Language models are few-shot

learners. Advances in neural information processing

systems, 33:1877–1901.

Chen, D. and Manning, C. (2014). A fast and accurate de-

pendency parser using neural networks. In Proc. Con-

ference on Empirical Methods in Natural Language

Processing (EMNLP), pages 740–750, Doha, Qatar.

ACL.

Chen, Q., Zhuo, Z., and Wang, W. (2019). BERT for joint

intent classiﬁcation and slot ﬁlling. arXiv preprint

arXiv:1902.10909.

Clark, K., Khandelwal, U., Levy, O., and Manning, C. D.

(2019). What does BERT look at? an analysis of bert’s

attention. In Linzen, T., Chrupala, G., Belinkov, Y.,

and Hupkes, D., editors, Proceedings of the 2019 ACL

Workshop BlackboxNLP: Analyzing and Interpreting

Neural Networks for NLP, BlackboxNLP@ACL 2019,

Florence, Italy, August 1, 2019, pages 276–286. Asso-

ciation for Computational Linguistics.

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019).

BERT: pre-training of deep bidirectional transformers

for language understanding. In Burstein, J., Doran,

C., and Solorio, T., editors, Proceedings of the 2019

Conference of the North American Chapter of the As-

sociation for Computational Linguistics: Human Lan-

guage Technologies, NAACL-HLT 2019, Minneapolis,

MN, USA, June 2-7, 2019, Volume 1 (Long and Short

Papers), pages 4171–4186. Association for Computa-

tional Linguistics.

Durrani, N., Sajjad, H., and Dalvi, F. (2021). How transfer

learning impacts linguistic knowledge in deep NLP

models? In Zong, C., Xia, F., Li, W., and Navigli, R.,

editors, Findings of the Association for Computational

Linguistics: ACL/IJCNLP 2021, Online Event, August

1-6, 2021, volume ACL/IJCNLP 2021 of Findings of

ACL, pages 4947–4957. Association for Computational

Linguistics.

Fei, H., Ren, Y., and Ji, D. (2020). Mimic and conquer: Het-

erogeneous tree structure distillation for syntactic NLP.

In Cohn, T., He, Y., and Liu, Y., editors, Findings of the

Association for Computational Linguistics: EMNLP

2020, Online Event, 16-20 November 2020, volume

EMNLP 2020 of Findings of ACL, pages 183–193.

Association for Computational Linguistics.

Glaser, J. I., Benjamin, A. S., Chowdhury, R. H., Perich,

M. G., Miller, L. E., and Kording, K. P. (2020). Ma-

chine learning for neural decoding. Eneuro, 7(4).

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

630

Hewitt, J., Ethayarajh, K., Liang, P., and Manning, C. D.

(2021). Conditional probing: measuring usable infor-

mation beyond a baseline. In Moens, M., Huang, X.,

Specia, L., and Yih, S. W., editors, Proceedings of

the 2021 Conference on Empirical Methods in Natu-

ral Language Processing, EMNLP 2021, Virtual Event

/ Punta Cana, Dominican Republic, 7-11 November,

2021, pages 1626–1639. Association for Computa-

tional Linguistics.

Hewitt, J. and Manning, C. D. (2019). A structural probe for

ﬁnding syntax in word representations. In Proc. Con-

ference of the North American Chapter of the Associa-

tion for Computational Linguistics: Human Language

Technologies, Vol. 1, pages 4129–4138.

Hu, J., Gauthier, J., Qian, P., Wilcox, E., and Levy, R. (2020).

A systematic assessment of syntactic generalization

in neural language models. In Jurafsky, D., Chai, J.,

Schluter, N., and Tetreault, J. R., editors, Proceed-

ings of the 58th Annual Meeting of the Association for

Computational Linguistics, ACL 2020, Online, July

5-10, 2020, pages 1725–1744. Association for Compu-

tational Linguistics.

Ivanova, A. A., Hewitt, J., and Zaslavsky, N. (2021). Probing

artiﬁcial neural networks: insights from neuroscience.

CoRR, abs/2104.08197.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Des-

jardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho,

T., Grabska-Barwinska, A., et al. (2017). Overcoming

catastrophic forgetting in neural networks. Proc. of the

National Academy of Sciences, 114(13):3521–3526.

Kriegeskorte, N. and Douglas, P. K. (2019). Interpreting

encoding and decoding models. Current opinion in

neurobiology, 55:167–179.

Kuncoro, A., Dyer, C., Rimell, L., Clark, S., and Blunsom,

P. (2019). Scalable syntax-aware language models

using knowledge distillation. In Korhonen, A., Traum,

D. R., and Màrquez, L., editors, Proceedings of the

57th Conference of the Association for Computational

Linguistics, ACL 2019, Florence, Italy, July 28- August

2, 2019, Volume 1: Long Papers, pages 3472–3484.

Association for Computational Linguistics.

Limisiewicz, T. and Marecek, D. (2020). Syntax represen-

tation in word embeddings and neural networks - A

survey. In Holena, M., Horváth, T., Kelemenová, A.,

Mráz, F., Pardubská, D., Plátek, M., and Sosík, P., edi-

tors, Proceedings of the 20th Conference Information

Technologies - Applications and Theory (ITAT 2020),

Hotel Tyrapol, Oravská Lesná, Slovakia, September

18-22, 2020, volume 2718 of CEUR Workshop Pro-

ceedings, pages 40–50. CEUR-WS.org.

Liu, N. F., Gardner, M., Belinkov, Y., Peters, M. E., and

Smith, N. A. (2019). Linguistic knowledge and trans-

ferability of contextual representations. In Burstein,

J., Doran, C., and Solorio, T., editors, Proceedings of

the 2019 Conference of the North American Chapter

of the Association for Computational Linguistics: Hu-

man Language Technologies, NAACL-HLT 2019, Min-

neapolis, MN, USA, June 2-7, 2019, Volume 1 (Long

and Short Papers), pages 1073–1094. Association for

Computational Linguistics.

Merchant, A., Rahimtoroghi, E., Pavlick, E., and Tenney, I.

(2020). What happens to BERT embeddings during

ﬁne-tuning? In Proceedings of the Third BlackboxNLP

Workshop on Analyzing and Interpreting Neural Net-

works for NLP, pages 33–44, Online. Association for

Computational Linguistics.

Mosbach, M., Khokhlova, A., Hedderich, M. A., and

Klakow, D. (2020). On the interplay between ﬁne-

tuning and sentence-level probing for linguistic knowl-

edge in pre-trained transformers. In Alishahi, A., Be-

linkov, Y., Chrupala, G., Hupkes, D., Pinter, Y., and

Sajjad, H., editors, Proceedings of the Third Black-

boxNLP Workshop on Analyzing and Interpreting Neu-

ral Networks for NLP, BlackboxNLP@EMNLP 2020,

Online, November 2020, pages 68–82. Association for

Computational Linguistics.

Nikoulina, V., Tezekbayev, M., Kozhakhmet, N.,

Babazhanova, M., Gallé, M., and Assylbekov, Z.

(2021). The rediscovery hypothesis: Language models

need to meet linguistics. Journal of Artiﬁcial Intelli-

gence Research, 72:1343–1384.

Nivre, J., de Marneffe, M., Ginter, F., Hajic, J., Manning,

C. D., Pyysalo, S., Schuster, S., Tyers, F. M., and

Zeman, D. (2020). Universal dependencies v2: An

evergrowing multilingual treebank collection. In Cal-

zolari, N., Béchet, F., Blache, P., Choukri, K., Cieri,

C., Declerck, T., Goggi, S., Isahara, H., Maegaard,

B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., and

Piperidis, S., editors, Proceedings of The 12th Lan-

guage Resources and Evaluation Conference, LREC

2020, Marseille, France, May 11-16, 2020, pages 4034–

4043. European Language Resources Association.

Pérez-Mayos, L., Ballesteros, M., and Wanner, L. (2021).

How much pretraining data do language models need to

learn syntax? In Moens, M., Huang, X., Specia, L., and

Yih, S. W., editors, Proceedings of the 2021 Conference

on Empirical Methods in Natural Language Processing,

EMNLP 2021, Virtual Event / Punta Cana, Dominican

Republic, 7-11 November, 2021, pages 1571–1582. As-

sociation for Computational Linguistics.

Pimentel, T., Valvoda, J., Maudslay, R. H., Zmigrod, R.,

Williams, A., and Cotterell, R. (2020). Information-

theoretic probing for linguistic structure. In Jurafsky,

D., Chai, J., Schluter, N., and Tetreault, J. R., editors,

Proceedings of the 58th Annual Meeting of the Associa-

tion for Computational Linguistics, ACL 2020, Online,

July 5-10, 2020, pages 4609–4622. Association for

Computational Linguistics.

Sowa

nski, M. and Janicki, A. (2020). Leyzer: A dataset for

multilingual virtual assistants. In Sojka, P., Kope

cek,

I., Pala, K., and Horák, A., editors, Proc. Conference

on Text, Speech, and Dialogue (TSD 2020), pages 477–

486, Brno, Czechia. Springer International Publishing.

Tenney, I., Xia, P., Chen, B., Wang, A., Poliak, A., McCoy,

R. T., Kim, N., Durme, B. V., Bowman, S. R., Das,

D., and Pavlick, E. (2019). What do you learn from

context? probing for sentence structure in contextual-

ized word representations. In 7th International Con-

ference on Learning Representations, ICLR 2019, New

Orleans, LA, USA, May 6-9, 2019. OpenReview.net.

Can We Use Probing to Better Understand Fine-Tuning and Knowledge Distillation of the BERT NLU?

631

Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S.,

and Lucic, M. (2020). On mutual information maxi-

mization for representation learning. In 8th Interna-

tional Conference on Learning Representations, ICLR

2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open-

Review.net.

Weld, H., Huang, X., Long, S., Poon, J., and Han, S. C.

(2021). A survey of joint intent detection and slot-

ﬁlling models in natural language understanding. arXiv

preprint arXiv:2101.08091.

Xu, Y., Zhao, S., Song, J., Stewart, R., and Ermon, S.

(2020). A theory of usable information under com-

putational constraints. In 8th International Conference

on Learning Representations, ICLR 2020, Addis Ababa,

Ethiopia, April 26-30, 2020. OpenReview.net.

Zhang, Y., Warstadt, A., Li, X., and Bowman, S. R. (2021).

When do you need billions of words of pretraining

data? In Zong, C., Xia, F., Li, W., and Navigli, R.,

editors, Proceedings of the 59th Annual Meeting of the

Association for Computational Linguistics and the 11th

International Joint Conference on Natural Language

Processing, ACL/IJCNLP 2021, (Volume 1: Long Pa-

pers), Virtual Event, August 1-6, 2021, pages 1112–

1125. Association for Computational Linguistics.

Zhu, Z., Shahtalebi, S., and Rudzicz, F. (2022). Predicting

ﬁne-tuning performance with probing. arXiv preprint

arXiv:2210.07352.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

632