Question Difﬁculty Decision for Automated Interview in Foreign

Language Test

Minghao Lin

, Koichiro Ito

and Shigeki Matsubara

1,2 a

Graduate School of Informatics, Nagoya University, Japan

Information & Communications, Nagoya University, Japan

Keywords:

Question Generation, Computer-Assisted Language Assessment, Language Speaking Test, Oral Proﬁciency

Interview.

Abstract:

As globalization continues to progress all over the world, demand is growing for objective and rapid assess-

ment of language proﬁciency in foreign language learners. While automated assessment of listening, reading,

and writing skills has been proposed, little research has been done to automate assessment of speaking skills.

In this paper, we propose a method of deciding the difﬁculty of questions generated for interview tests of skills

assessment in speaking a foreign language. To address question difﬁculty ﬂexibly according to the abilities

of test takers, our method considers the appropriateness of responses from test takers. We implemented this

method using the large-scale pre-trained language model BERT (Bidirectional Encoder Representation from

Transformers). Experiments were conducted using simulated test data from the Japanese Learner’s Conversa-

tion Database to conﬁrm the effectiveness of our method in deciding difﬁculty.

1 INTRODUCTION

The need to evaluate proﬁciency in a second language

is increasing, as the number of people moving over-

seas abroad to study or work is increasing all over

the world. The four main language skills that are

assessed to determine second language proﬁciency

are reading, writing, listening, and speaking (Powers,

2010). Automated assessment methods for the ﬁrst

three skills have been discussed in previous studies,

including question generation and machine scoring

(Huang et al., 2014) (Du et al., 2017) (Yannakoudakis

et al., 2011).

Assessment of second language speaking remains

an under-researched area due to its complexity. The

process of evaluation involves a vast amount of hu-

man labor, specialized equipment, and specialized en-

vironments. Some studies that previously investigated

speaking skills assessment were conducted to allevi-

ate this phenomenon and to accomplish the goal of

evaluating the speaking skills of individuals in a sec-

ond language (Litman et al., 2016).

According to (Bahari, 2021), computer-assisted

language assessment studies are moving towards non-

https://orcid.org/0000-0003-0416-3635

linear dynamic assessment, which focuses on the in-

dividual learner, by introducing interactive, dynamic,

and adaptive strategies. However, previous studies

in this area have hitherto focused on monologue-type

tests of speaking. This approach is insufﬁcient for im-

plementing a reliable test or for simulating a real-life

test environment for human–human oral proﬁciency.

Oral proﬁciency in language assessment includes

the following two parts:

• Test delivery: To acquire speech samples from a

second language learner, the following two main

types of tests have been proposed: monologue and

interview. A monologue test is conducted with

a static question difﬁculty. This means that the

questions are created in advance and are not to be

changed during the assessment. An interview test

is a standardized, global assessment of functional

speaking ability in the form of a conversation be-

tween the tester and the test taker. The questions

posed in an interview test are dynamic.

• Scoring: The analysis of the speech sample col-

lected in the test allows the speaking skills of the

test taker to be fully assessed. Different test strate-

gies focus on a vast variety of aspects, including

pronunciation and accuracy.

760

Lin, M., Ito, K. and Matsubara, S.

Question Difﬁculty Decision for Automated Interview in Foreign Language Test.

DOI: 10.5220/0011774900003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 760-767

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

According to (Bernstein et al., 2010), testers ex-

pect to see certain elements of real-life communica-

tion to be represented in the test. The interview test

meets this expectation to the greatest degree among

oral proﬁciency tests. The interview test is always

conducted dynamically, in that the tester does not al-

ways pose questions with the same difﬁculty.

In this paper, we propose a method of deciding

the proper difﬁculty for questions in an interview test

in response to the second language speaking skill of

individual test takers. This method is built upon the

large-scale pre-trained language model BERT (Bidi-

rectional Encoder Representation from Transformers)

(Devlin et al., 2019). BERT was proposed to perform

natural language processing and has proven effective

in sentiment analysis, question answering, document

summarization, and other tasks. This motivates us to

imply such a model to difﬁculty decision tasks.

The conversational context and additional appro-

priateness information for responses of test takers are

used in our method. We conducted our experiments

using a simulated language oral speaking interview

test dataset to validate our method. This method out-

performed the baseline models, conﬁrming the valid-

ity of our proposed method.

The remainder of this paper is organized as fol-

lows: Section 2 reviews recent research on the au-

tomation of the oral proﬁciency assessment. Section 3

describes the difﬁculty decision task and provides in-

sight into the structural design of the proposed model.

Section 4 presents a full account of the experimental

setting. Finally, we provide a brief summary of our

work and discuss potential objections to our plan for

future work.

2 RELATED WORK

Many studies put effort into the automation of oral

proﬁciency assessment with static question difﬁculty.

The samples that were used to judge speaking skills

of a test taker were collected as monologues. In

(Yoon and Lee, 2019), the authors collected around

one minute of spontaneous speech samples from test

takers, including readings and/or answers after lis-

tening to a passage. They used a Siamese convo-

lutional neural network (Mueller and Thyagarajan,

2016) to model the semantic relationship between the

key points generated by experts and test responses.

The neural network was also used to score the speak-

ing skill of each test taker. In (Zechner et al., 2014),

the authors collected restricted and semi-restricted

speech from test takers. The restricted speeches in-

volved reading and repeating a passage. In the semi-

restricted speech, the test taker is required to provide

sufﬁcient remaining content to formulate a complete

response, corresponding to an image or chart. The

authors proposed a method that combines diverse as-

pects of the features of speaking proﬁciency using a

linear regression model to predict response scores.

The studies mentioned above use a strategy that

collects samples manually and then analyzes them us-

ing algorithms. Some previous studies also use ma-

chines to deliver tests and collect samples. In a Pear-

son Test of English (Longman, 2012), test takers are

requested to repeat sentences, answer short questions,

perform sentence builds, and retell passages to a ma-

chine. Their responses are analyzed by algorithms

(Bernstein et al., 2010). In (de Wet et al., 2009), the

authors designed a spoken dialogue system for test

takers to guide them and capture their answers. The

system involves reading tasks and repetition tasks.

The authors used the automatic speech recognition to

evaluate speaking skills of test takers, focusing on ﬂu-

ency, pronunciation, and repeat accuracy.

Oral proﬁciency tests delivered using the mono-

logue test have been used to evaluate the speaking

skill of test takers and have shown a high degree of

correlation with the interview test (Bernstein et al.,

2010). However, automation of the interview assess-

ment method with dynamic question difﬁculty has not

been developed to the extent that automation of the

monologue one has. This paper seeks to ﬁll this gap.

3 METHOD

3.1 Problem Setting

In an interview test, the tester follows the strategy be-

low:

• First, the appropriateness of the responses of the

test taker is estimated.

• Second, based on this appropriateness measure,

the difﬁculty of the question to be given next is

decided.

As noted in (Kasper, 2006), if it is difﬁcult for the

test taker to respond to the given question, the tester

would change the question as the next action. The

questions target a speciﬁc oral proﬁciency level and

functioning at that level (ACTFL, 2012). This means

that an automated interview test should have the abil-

ity to adjust the difﬁculty level of its questions dur-

ing the course of an interview test. In this paper, we

propose a method to decide the difﬁculty of the next

question posed by the tester. Our method should ﬁrst

estimate the appropriateness of the response of the

Question Difﬁculty Decision for Automated Interview in Foreign Language Test

761

Table 1: Examples of interview test dialogue. The appro-

priateness and the difﬁculty are shown below the responses

and questions, respectively.

utterances (appropriateness / difﬁculty)

What is your favorite sport?

(easy)

My favorite sport is soccer.

(appropriate)

Could you please explain the rule of it?

(difﬁcult)

Uh, it’s ... a game like ...

I’m sorry, it’s hard to tell.

(inappropriate)

OK, I see. Then ...

Do you ﬁnd this sport funny?

(easy)

Oh yes, I enjoy playing it.

(appropriate)

test taker and then choose a suitable difﬁculty level

for the next question.

We denote the entire interview test dialogue with

the symbol D. D contains the following two types

of utterances: questions delivered by the tester and

responses performed by the test taker. The ith

question in D is denoted as Q

, while ith response

in D is denoted as R

. When there are n pairs

of questions and responses in D, D is denoted as

, R

, Q

, R

, · · · , Q

, R

]. An example dialogue is

shown in Table 1. The second question utterance is

the question Q

(= “Could you please explain the rule

of it?”) asking the test taker to explain the rules of

the sport. The test taker ﬁnds it difﬁcult to produce a

response. The test taker makes the following inappro-

priate response: R

(= “Uh, it’s ... a game like ... I’m

sorry, it’s hard to tell.”). Here, the tester should re-

duce the difﬁculty level of the questions and provide

an easy question next. Conversely, if the test taker

were to give the response, e.g., R

′

(= “Basically, it is

a sport in which two teams of 11 players use a sin-

gle ball and kick it into each other’s goal.”), the test

taker has successfully answered the tester’s question

and given an appropriate response. The tester is more

likely to give a difﬁcult question next, such as asking

for the test taker’s opinion of whether it is fair to the

average student for athletes to be preferred for admis-

sion to the university.

3.2 Models Structure

To perform question difﬁculty decision in an inter-

view test, we propose a two-step method that uses

the BERT (Devlin et al., 2019), as shown in Figure

1. BERT is built upon the Transformer architecture

(Vaswani et al., 2017). Some previous works have ap-

dĂƌŐĞƚƌĞƐƉŽŶƐĞ

ĂƉƉƌŽƉƌŝĂƚĞŶĞƐƐ

dĂƌŐĞƚƌĞƐƉŽŶƐĞ͕ƉƌĞǀŝŽƵƐƌĞƐƉŽŶƐĞƐ͕

ĂŶĚĐŽƌƌĞƐƉŽŶĚŝŶŐƋƵĞƐƚŝŽŶƐ

ŝĨĨŝĐƵůƚǇŵŽĚĞů

EĞǆƚƋƵĞƐƚŝŽŶĚŝĨĨŝĐƵůƚǇ

ƉƉƌŽƉƌŝĂƚĞŶĞƐƐ ŵŽĚĞů

Figure 1: Structure of our method.

plied BERT to dialogue-related tasks (Wu et al., 2020)

(Song et al., 2021) (Wang et al., 2021). Our method

consists of two parts, namely, the response appropri-

ateness estimation model and the question difﬁculty

decision model. We deﬁne our consecutive method

structure as follows: given an interview test dialogue

context, the method estimates response appropriate-

ness to identify whether the response given is appro-

priate as a response to the given question. Then, the

method uses the appropriateness estimation result and

the dialogue context mentioned above to set the difﬁ-

culty of the next question.

3.2.1 Response Appropriateness Estimation

The main structure of our model is shown in Figure 2.

To estimate the appropriateness of response R

, two

pairs of questions and responses [Q

i−1

, R

i−1

, Q

, R

]

are used as input for the model. Q

is the question

corresponding to response R

, and [Q

i−1

, R

i−1

] serve

as the context. The two pairs of questions and re-

sponses are input into the tokenizer, and [SEP] tokens

are inserted at the end of each tokenized utterance.

The input tokens of the utterance are denoted with the

corresponding lowercase letter. For example, R

is de-

noted as [r

, r

, · · · , r

], where r

is the jth token in

The input token embedding is denoted as E. For

example, E

denotes the token embedding of token

, or the jth token in R

. The two types of segment

embeddings are denoted by E

and E

. E

denotes the

position embedding of the ith token in the input token

sequence. The last hidden states of all tokens are de-

noted as T . For example, T

denotes the last hidden

states of the token r

. The representation of [CLS]

token T

[CLS]

was used as input to the classiﬁer. The

classiﬁer has linear transformation layers with a soft-

max function, performing a response appropriateness

estimation.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

762

΀>^΁ Ƌ

ŝͲϭ

΀^W΁ ƌ

ŝͲϭ

΀^W΁ Ƌ

΀^W΁ ƌ

΀^W΁



΀>^΁



΀^W΁



΀^W΁



΀^W΁



΀^W΁







































ZdŶĐŽĚĞƌ

΀>^΁

΀^W΁

н н н н н н н н н

н н

dŽŬĞŶ

ĞŵďĞĚĚŝŶŐƐ

^ĞŐŵĞŶƚ

ĞŵďĞĚĚŝŶŐƐ

WŽƐŝƚŝŽŶ

ĞŵďĞĚĚŝŶŐƐ

>ŝŶĞĂƌн^ŽĨƚŵĂǆ

7DUJHWUHVSRQVHDSSURSULDWHQHVV

ŝͲϭ

ŝͲϭ ŝ

/ŶƉƵƚ

ƚŽŬĞŶƐ

ŝͲϭ







ŝͲϭ







ŝͲϭ











Figure 2: Structure of proposed method for response appropriateness estimation.

3.2.2 Question Difﬁculty Decision

Within the interview test dialogue context, our

method utilizes the corresponding response appropri-

ateness estimation result to decide on difﬁculty of

questions. Response appropriateness estimation re-

sult is represented by a special token set [[0], [1]], as

it has been shown that special tokens can be useful in

representing abstract concepts (Xie et al., 2021). To-

ken [0] denotes an inappropriate response, while to-

ken [1] denotes an appropriate one. In the following,

the special token is also referred to as the response

appropriateness label.

To decide the difﬁculty of the next question Q

i+1

our method uses two pairs of questions and responses

i−1

, R

i−1

, Q

, R

] as input information. The appro-

priateness label of R

was taken as additional infor-

mation. With the exception of the additional appro-

priateness label and the input format in the sentence-

pair conﬁguration, the other parts of the model remain

the same as in the appropriateness estimation model.

The main structure of the difﬁculty decision model is

shown in Figure 3.

We build our model using the label-fusing method

proposed by (Xiong et al., 2021). Although their

method utilizes the default sentence-pair input con-

ﬁguration in BERT, it uses different segment embed-

dings for the labels and context, without changing the

original encoder structure. This means that the non-

natural language labels and natural language contexts

can be distinguished by segment embedding (E

and

) and concatenated with a [SEP] token as input.

This method, which utilizes a label embedding tech-

nique, could improve the performance of BERT in

text classiﬁcation while maintaining nearly the same

computational cost. The appropriateness estimation

information, which is represented as a special token,

is added to the BERT dictionary. The appropriateness

label that serves as input is denoted by L.

4 EXPERIMENTS

4.1 Dataset

To evaluate the effectiveness of our method, we per-

formed experiments on simulation test data in the

Japanese Learner’s Conversation Database (JLCD)

published by the National Institute for Japanese Lan-

guage. All of the simulated interview tests were per-

formed according to the ACTFL Oral Proﬁciency In-

terview (OPI) standard, which assesses the ability to

use language effectively and appropriately in real-life

situations (ACTFL, 2012). After assessment in tests

that take around twenty minutes, test takers are rated

on the following four proﬁciency levels: superior, ad-

vanced, intermediate, and novice. All interview con-

tents are presented as spoken transcripts created man-

ually.

In all, there are 390 transcript data in the dataset,

of which we used 59. We split these data into ap-

proximately 8:1:1 for training, development, and test-

ing. We divided the transcript data into utterances.

The transcript data were divided into 3,251 question

utterances and 3,251 response utterances. We man-

ually annotated each utterance. The response utter-

ances spoken by the test takers were marked as ap-

propriate or inappropriate, with reference to both the

https://mmsrv.ninjal.ac.jp/kaiwa/DB-summary.html

Question Difﬁculty Decision for Automated Interview in Foreign Language Test

763

΀>^΁ Ƌ

ŝͲϭ

΀^W΁ ƌ

ŝͲϭ

΀^W΁ Ƌ

΀^W΁ ƌ

΀^W΁



΀>^΁



΀^W΁



΀^W΁



΀^W΁



΀^W΁







































ZdŶĐŽĚĞƌ

΀>^΁

΀^W΁

н н н н н н н н н

н н

dŽŬĞŶ

ĞŵďĞĚĚŝŶŐƐ

^ĞŐŵĞŶƚ

ĞŵďĞĚĚŝŶŐƐ

WŽƐŝƚŝŽŶ

ĞŵďĞĚĚŝŶŐƐ

>ŝŶĞĂƌн^ŽĨƚŵĂǆ

1H[WTXHVWLRQGLIILFXOW\

ŝͲϭ

ŝͲϭ ŝ

/ŶƉƵƚ

ƚŽŬĞŶƐ

ŝͲϭ







ŝͲϭ







ŝͲϭ











΀^W΁



΀^W΁







΀^W΁







Figure 3: Structure of proposed method for next question difﬁculty decision. Input token L represents the appropriateness

label of the target response.

Table 2: Response appropriateness distribution.

label train dev test

appropriate 2,069 233 292

inappropriate 491 73 93

Table 3: Question difﬁculty distribution.

label train dev test

easy 1,829 228 287

difﬁcult 731 78 98

response and the corresponding question. Table 2

presents the distribution of response appropriateness

across datasets.

In the ACTFL-OPI standard, novice and interme-

diate level questions involve simple inquiries, and test

takers are asked to make binary choices or provide

simple factual conclusions. Then, advanced questions

involve describing an object in detail, retelling a story,

and comparing two targets. Questions on the superior

level involve expressing personal opinions or views

on social phenomena and abstract concepts. In this

paper, we deﬁned the decision for the difﬁculty of the

next question as a binary classiﬁcation task, in refer-

ence to whether the next generated question would be

easy or difﬁcult. Questions involving binary choices

or providing simple factual conclusions were treated

as easy questions. Questions involving describing ob-

jects in detail, presenting personal opinions, and the

rest were treated as difﬁcult questions. The question

utterances spoken by testers were marked as easy or

difﬁcult. Table 3 exhibits the question difﬁculty dis-

tribution across datasets.

4.2 Model Settings

We implemented the response appropriateness esti-

mation model and the question difﬁculty decision

model by ﬁne-tuning pre-trained BERT. We used the

bert-base-japanese model

published by Tohoku Uni-

versity for the appropriateness estimation model and

the difﬁculty decision model. Both models were set

to perform binary classiﬁcation tasks to complete ap-

propriateness estimation and difﬁculty decision. The

learning rate was 1e-5 for the appropriateness esti-

mation model, and 2e-7 for the difﬁculty decision

model. The batch size was set to 16. We used

AdamW(Loshchilov and Hutter, 2019) as the opti-

mizer, and we set the weight decay to 0.01. The

dropout rate was kept at 0.1. The maximum input to-

ken length remained at 512. We used cross-entropy

loss for the training.

The distribution of labels was imbalanced in both

the response appropriateness estimation task and the

question difﬁculty decision task. We calculated each

class weight using Equation (1), and we utilized the

class weight while training our model. For W

, i.e.,

weight of a class j, n

samples

mark the total number of

samples in the data, and n

classes

represents the total

number of unique classes, while n

samples

gives the

total number of samples in the class j.

samples

classes

∗ n

samples

(1)

In the training of the question difﬁculty deci-

sion model, manually annotated appropriateness la-

bels were used for responses. While testing, labels

that were estimated using the response appropriate-

ness estimation model were used instead of the man-

https://github.com/cl-tohoku/bert-japanese

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

764

ually annotated labels. Therefore, the performance

of the question difﬁculty decision model on the test

data was affected by error in the response appropri-

ateness estimation model. This proposed method is

denoted as [ours (noisy)]. Each model was trained

for ten epochs for the corresponding task. The mod-

els were evaluated in the development set at the end

of each epoch, and the those with the lowest evalua-

tion loss were saved. We report performance of those

models on the test set. Training, evaluation, and test-

ing processes of both models were done on a single

NVIDIA RTX A5000 graphics card.

4.3 Baselines and Metrics

In the response appropriateness estimation task, we

implemented a random prediction baseline. The rate

of predicting that a target response was appropriate

was set to 0.809, based on the appropriate response

rate in our training data.

For the question difﬁculty decision task, we im-

plemented the following four methods by comparison

with the proposed method [ours (noisy)].

• [random]: This method randomly predicts the

difﬁculty of questions. The rate of predicting

whether the difﬁculty of the following question

was easy was set to 0.714, based on the easy ques-

tion rate in our training data.

• [vanilla BERT]: This method only uses dialogue

context as an input in making question difﬁculty

decisions, which means that no appropriateness

label was included. This method was developed

to evaluate the validity of introducing appropri-

ateness information.

• [ours (noisy) w/o seg]: This method excludes dif-

ferent segment embedding in our model. The out-

put labels for the response appropriateness model

were used while testing. This method is used to

verify the effectiveness of segment embeddings

in distinguishing special tokens from natural lan-

guage context.

• [ours (correct)]: This method uses correct ap-

propriateness labels with manual annotations for

training and testing. In this method, all labels

serve as correct appropriateness information and

are included in the input for the difﬁculty deci-

sion model. This method is intended to evaluate

the performance of the difﬁculty decision model

where the appropriateness estimation result pro-

duced by the appropriateness estimation model

was absolutely correct. The performance of this

method is the reference value in the ideal environ-

ment.

Table 4: Performance of appropriateness estimation method

on the test set. All the results were macro-averaged.

Precision Recall F1

[random] 0.473 0.480 0.472

[ours] 0.665 0.653 0.658

Table 5: Performance of difﬁculty decision method on the

test set. All the results were macro-averaged.

Precision Recall F1

[random] 0.500 0.500 0.499

[vanilla BERT] 0.605 0.619 0.609

[ours (noisy) w/o seg] 0.626 0.638 0.631

[ours (noisy)] 0.629 0.648 0.635

[ours (correct)] 0.631 0.650 0.637

We report macro-precision, macro-recall, and

macro-F1 for all of our experiments, as we are not

only focused on the performance of a single label, but

both are important.

4.4 Experimental Results

We report the result of the evaluation of the appropri-

ateness estimation method in Table 4. The macro-F1

results show that the appropriateness estimation for

the responses were feasibly implemented by utilizing

the BERT model. The accuracy of the BERT model

is 0.758. Therefore, in testing the question difﬁculty

decision method [ours (noisy)], the correct response

appropriateness labels were input at a rate of 0.758.

Table 5 shows the results of the evaluation of

the decision method for question difﬁculty. It shows

that the proposed method [ours (noisy)] outperformed

[random] and [vanilla BERT], in which the appro-

priateness information was not presented

. Table 6

shows an example where [ours (noisy)] was correct,

but [vanilla BERT] was not correct. In this exam-

ple, [ours (noisy)] correctly predicted that the ques-

tion generated after R

would be difﬁcult by utilizing

information that R

was an appropriate response.

The F1 score of the proposed method [ours

(noisy)] shows a slight decrease relative to [ours (cor-

rect)]. It is thought that this result is due to the error

in the estimation model of response appropriateness.

The performance of [ours (noisy) w/o seg] is degraded

relative to [ours (noisy)], indicating that the different

segment embeddings are effective for the importation

of response appropriateness labels into the input se-

quences.

The difference between the proposed method [ours

(noisy)] and other methods were evaluated using McNe-

mar’s test with p < 0.05. The random prediction method

showed a statistically signiﬁcant difference, while the re-

maining methods did not.

Question Difﬁculty Decision for Automated Interview in Foreign Language Test

765

Table 6: Examples of a sample where [vanilla BERT] provides an incorrect difﬁculty decision, but [ours (noisy)] is correct.

The appropriateness of R

shows the result of the response appropriateness estimation result, and this estimation result was

correct. The difﬁculty columns show the true labels in the data.

utterance appropriateness difﬁculty [vanilla BERT] [ours (noisy)]

So you send messages with some

characters that can be pronounced

like alphabets?

- easy - -

You mean characters that can be

pronounced.

- - - -

Yes, do you use the Russian alpha-

bets to write Mongolian texts, like

writing in the English alphabets?

- easy - -

It’s similar to Russian alphabets,

but not identical.

appropriate - - -

So what’s the difference between

Mongolian alphabets and Russian

alphabets?

- difﬁcult easy difﬁcult

5 CONCLUSIONS

We proposed a two-step method for question difﬁ-

culty decision for automated interview test delivery

using the large-scale language model BERT, includ-

ing response appropriateness estimation. The exper-

imental results show that the judgments of appropri-

ateness estimation were useful when deciding the dif-

ﬁculty of the subsequent question.

This method only uses two pairs of questions and

responses in the dialogue context as input. Long-term

contexts containing temporal information could play

an important role in dialogue-related tasks, and they

should not be simply discarded. Thus, we intend to

explore this further in later work. We will also ex-

amine different means of incorporating additional in-

formation that could be beneﬁcial for interview test

automation. For example, tester strategy may differ

between the early and late stages of an interview. The

early stage focuses on probing, and later stage fo-

cus on level checking. Therefore, the amount of time

spent on the current interview will have an impact on

future difﬁculty decisions.

REFERENCES

ACTFL (2012). Oral proﬁciency interview familiarization

manual. https://community.actﬂ.org.

Bahari, A. (2021). Computer-assisted language proﬁciency

assessment tools and strategies. Open Learning: The

Journal of Open, Distance and e-Learning, 36(1):61–

87.

Bernstein, J., Van Moere, A., and Cheng, J. (2010). Val-

idating automated speaking tests. Language Testing,

27(3):355–377.

de Wet, F., Van der Walt, C., and Niesler, T. (2009). Au-

tomatic assessment of oral language proﬁciency and

listening comprehension. Speech Communication,

51(10):864–874.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). BERT: Pre-training of deep bidirectional

transformers for language understanding. In Pro-

ceedings of the 2019 Conference of the North Amer-

ican Chapter of the Association for Computational

Linguistics: Human Language Technologies, Volume

1 (Long and Short Papers), pages 4171–4186, Min-

neapolis, Minnesota. Association for Computational

Linguistics.

Du, X., Shao, J., and Cardie, C. (2017). Learning to ask:

Neural question generation for reading comprehen-

sion. In Proceedings of the 55th Annual Meeting of the

Association for Computational Linguistics (Volume 1:

Long Papers), pages 1342–1352, Vancouver, Canada.

Association for Computational Linguistics.

Huang, Y.-T., Tseng, Y.-M., Sun, Y. S., and Chen, M. C.

(2014). Tedquiz: automatic quiz generation for

ted talks video clips to assess listening comprehen-

sion. In 2014 IEEE 14th International Conference

on Advanced Learning Technologies, pages 350–354,

Athens, Greece. IEEE.

Kasper, G. (2006). When once is not enough: Politeness of

multiple requests in oral proﬁciency interviews. Mul-

tilingua, 25(3):323–350.

Litman, D., Young, S., Gales, M., Knill, K., Ottewell, K.,

Van Dalen, R., and Vandyke, D. (2016). Towards us-

ing conversations with spoken dialogue systems in the

automated assessment of non-native speakers of En-

glish. In Proceedings of the 17th Annual Meeting

of the Special Interest Group on Discourse and Di-

alogue, pages 270–275, Los Angeles.

Longman, P. (2012). Ofﬁcial Guide to Pearson Test of En-

glish Academic. Pearson Japan, London, 2nd edition.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

766

Loshchilov, I. and Hutter, F. (2019). Decoupled weight de-

cay regularization. In Proceedings of the 7th Interna-

tional Conference on Learning Representations, New

Orleans, LA, USA.

Mueller, J. and Thyagarajan, A. (2016). Siamese recurrent

architectures for learning sentence similarity. In Pro-

ceedings of the 30th AAAI Conference on Artiﬁcial In-

telligence, page 2786–2792. AAAI Press.

Powers, D. E. (2010). The case for a comprehensive, four-

skills assessment of English-language proﬁciency. R

& D Connections, 14:1–12.

Song, H., Wang, Y., Zhang, K., Zhang, W.-N., and Liu,

T. (2021). BoB: BERT over BERT for training

persona-based dialogue models from limited person-

alized data. In Proceedings of the 59th Annual Meet-

ing of the Association for Computational Linguistics

and the 11th International Joint Conference on Natu-

ral Language Processing (Volume 1: Long Papers),

pages 167–177, Online. Association for Computa-

tional Linguistics.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Advances in neural

information processing systems, 30.

Wang, P., Fang, J., and Reinspach, J. (2021). CS-BERT:

a pretrained model for customer service dialogues.

In Proceedings of the 3rd Workshop on Natural Lan-

guage Processing for Conversational AI, pages 130–

142, Online. Association for Computational Linguis-

tics.

Wu, C.-S., Hoi, S. C., Socher, R., and Xiong, C. (2020).

TOD-BERT: Pre-trained natural language understand-

ing for task-oriented dialogue. In Proceedings of the

2020 Conference on Empirical Methods in Natural

Language Processing, pages 917–929, Online. Asso-

ciation for Computational Linguistics.

Xie, Y., Xing, L., Peng, W., and Hu, Y. (2021). IIE-

NLP-eyas at SemEval-2021 task 4: Enhancing PLM

for ReCAM with special tokens, re-ranking, Siamese

encoders and back translation. In Proceedings of

the 15th International Workshop on Semantic Evalua-

tion, pages 199–204, Online. Association for Compu-

tational Linguistics.

Xiong, Y., Feng, Y., Wu, H., Kamigaito, H., and Okumura,

M. (2021). Fusing label embedding into BERT: An

efﬁcient improvement for text classiﬁcation. In Find-

ings of the Association for Computational Linguistics:

ACL-IJCNLP 2021, pages 1743–1750, Online. Asso-

ciation for Computational Linguistics.

Yannakoudakis, H., Briscoe, T., and Medlock, B. (2011).

A new dataset and method for automatically grading

ESOL texts. In Proceedings of the 49th Annual Meet-

ing of the Association for Computational Linguistics:

Human Language Technologies, pages 180–189, Port-

land, Oregon, USA. Association for Computational

Linguistics.

Yoon, S.-Y. and Lee, C. M. (2019). Content modeling for

automated oral proﬁciency scoring system. In Pro-

ceedings of the Fourteenth Workshop on Innovative

Use of NLP for Building Educational Applications,

pages 394–401, Florence, Italy. Association for Com-

putational Linguistics.

Zechner, K., Evanini, K., Yoon, S.-Y., Davis, L., Wang, X.,

Chen, L., Lee, C. M., and Leong, C. W. (2014). Au-

tomated scoring of speaking items in an assessment

for teachers of English as a foreign language. In Pro-

ceedings of the Ninth Workshop on Innovative Use

of NLP for Building Educational Applications, pages

134–142, Baltimore, Maryland. Association for Com-

putational Linguistics.

Question Difﬁculty Decision for Automated Interview in Foreign Language Test

767