An Efficient Approach based on BERT and Recurrent Neural

Network for Multi-turn Spoken Dialogue Understanding

Weixing Xiong

, Li Ma

and Hongtao Liao

Ubtech Robotics Corp, Nanshan I Park, No.1001 Xueyuan Road, Shenzhen, China

Keywords: BERT, Recurrent Neural Network, Multi-turn, Spoken Dialogue Understanding.

Abstract: The main challenge of the Spoken Language Understanding (SLU) is how to parse efficiently natural language

into effective meanings, such as its topic intents, acts and pairs of slot-values that can be processed by com-

puters. In multi-turn dialogues, the combination of context information is necessary to understand the user's

objectives, which can be used to avoid ambiguity. An approach processing multi-turn dialogues, based on the

combination of BERT encoding and hierarchical RNN, is proposed in this paper. More specifically, it com-

bines the current user's utterance with each historical sequence to formulate an input to the BERT module to

extract the semantic relationship, then it uses a model derived from the hierarchical-RNN for the understand-

ing of intents, actions and slots. According to our experiments by testing with multi-turn dialogue dataset

Sim-R and Sim-M, this approach achieved about 5% improvement in FrameAcc compared with models such

as MemNet and SDEN.

1 INTRODUCTION

In a task-oriented dialogue, all the information needed

for a specific task or purpose may not be given in a

single turn expression by the user’s utterance. It’s

necessary to engage a multi-turn dialogue to obtain

mandatory information. The main function of a SLU

module is to extract efficiently, from the user’s input,

the intents, actions and slots-value pairs (Dilek

Hakkani-Tur et al., 2016).

Figure 1: An example semantic frame with slot, intent and

dialogue act annotations, following the IOB tagging

scheme.

https://orcid.org/0000-0002-0929-096X

https://orcid.org/0000-0002-9297-4632

Equal contribution, ordered by random shuffle.

Figure 1 shows the semantic frame information

about a task-oriented dialogue, in which the word slot

is represented in a general IOB format.

In real dialogue, all necessary information will be

specified along with the dialogue flow. Let’s continue

the above example:

S2: “How many people will attend the dinner?”

U3: "5."

So, the user utterance,"5", corresponds to the en-

tity category “B-#people”, for the circumstance of the

booking restaurant task. The difficulty is how a com-

puter system can analyze all the information to extract

the intents, acts and slots.

Such as the one proposed by P. Xu and R. Sari-

kaya, 2013, B. Liu et al., 2016, or Zhang et al., 2016,

use the method of jointly modeling intents and slots

with RNN, but they do not take the necessary context

information into account.

MemNet, proposed by S. Sukhbaatar et al., 2015;

Chen et al., 2016, or SDEN, proposed by Ankur

Bapna et al., 2017, Raghav et al., 2018, can effec-

tively take into account the historical context to un-

derstand semantic frames by encoding information as

Xiong, W., Ma, L. and Liao, H.

An Efﬁcient Approach based on BERT and Recurrent Neural Network for Multi-turn Spoken Dialogue Understanding.

DOI: 10.5220/0009101207930800

In Proceedings of the 12th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2020) - Volume 2, pages 793-800

ISBN: 978-989-758-395-7; ISSN: 2184-433X

793

Figure 2: MSDU model. The model contains two kinds of RNN models, RNN(BiGRU

) encoding over utterance tokens while

RNN(BiGRU

and BiGRU

) encoding over utterances.

sentence vectors by GRU. However, in MemNet and

SDEN, each sentence needs to be encoded as a single

vector, this leads to the loss of lexical-granularity in-

formation when analysing the relationship between

the current sentence and the historical inputs.

In this paper, we propose a model based on

BERT

(Jacob Devlin et al., 2018) and hierarchical

RNN.

We encode the current user’s utterance and histori-

cal dialogues successively as the input of BERT mod-

ule, and then to encode the memory from context, we

use a modified hierarchical-RNN to process the outputs

of BERT module.

There are three main aspects of our contribution.

Firstly, by using the BERT model, the attention of ad-

jacent words is introduced for word and sentence em-

bedding. Secondly, by concatenating the current utter-

ance with each historical utterance, the model can cal-

culate attention with other turns of utterance when per-

forming BERT encoding. Thirdly, by using a modified

hierarchical-RNN to process the outputs of BERT

module, information from the context can be more ef-

fectively encoded.

The following sections in this paper are organized

as: In section 2, we describe the general architecture of

our model. In section 3, we list the experimental results

and analyze them. In last section, the conclusion and

discussion would be illustrated.

The open source BERT implementation based on Py

torch is available at https://github.com/huggingface/pyto

rch-pretrained-BERT. Note that the pre-trained BERT

has two version. In our experiment, we use the base

version.

2 MSDU MODEL

We abbreviated our new model to MSDU, which is

an acronym for “Multi-turn Spoken Dialogue Under-

standing”. The model we proposed is dedicated to

handle multi-turn spoken dialogue understanding and

intents information extraction, its overall structure is

shown in Figure 2. We divide a sequence of dialogues

into n turns, and each of them containing the user's

utterance u

and the system responding utterance s

The user's current (the last input) utterance u

can be

represented by formula (1).

}

,,,,{

)len(1)len(

321

nnnnn

,wwwwwu

−

= "

(1)

Where w

represents word token in the utterance

and len(u

) means the number of tokens in u

。

So, there are n-1 user utterances and n-1 system

replies in the historical dialogue, represented as for-

mula (2).

{}

112211

,,,,,

−−

sususuD "，

(2)

In multi-turn dialogues, the important matter is

how to use effectively the context information

in the

conversation to track the current state. In order to ob-

tain the related information between the current user

utterance and the historical utterances, we need to

build some relationship between them. Therefore, we

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

794

have designed a concatenation method, its detail is

given in the Part 2.1.

2.1 Concatenation Method

We concatenate the current user utterance u

with

each utterance in the historical turns (u

2, …,

n-1

) to form a new couple sentence vector C. For

instance, the concatenation of (u

n-1

) is expressed as

followings:

}

{

len

111

,[PAD]D],,[SEP],[PA,w, w

,[SEP],,w,.[CLS],wC

)(u

)(s

ssus

nnnn

−

−−−

(3)

}0,,0,1,,1,0,,0{

)len(2)len(



−

(4)

}

{

len

111

,[PAD]D],,[SEP],[PA,w,w

,[SEP],,w,[CLS],wC

)(u

uuuu

(5)

}0,,0,1,,1,0,,0{

)(len2)len(



(6)

Where

indicates the i-th token in utterance u,

len(un) and len(sn-1) means respectively the number

of tokens in utterance un and sn-1. [CLS], [SEP] and

[PAD] are special tags in BERT inputs, where [CLS]

represents the beginning of a sequence, [SEP] is a

separator for two sequences, and [PAD] is used to pad

all sequences to the same length.

In order to facilitate the calculation of the model,

we need to add paddings in the sequence to a fixed

number, in our experiment case, it is set to 64.

Then we generate a Boolean vector to indicates

which words are from the current user utterance in C.

The generated Boolean vector B is shown in formula

(4) and formula (6). In our example, we obtain 64-

len(u

) of 0 and len(u

) of 1.

In real application scenario, the user's current ut-

terance could not only be a response to the last system

utterance, but also a response to an earlier system ut-

terance or a supplement to previous user utterance.

Therefore, in order to get a better modeling effect,

we need to take all historical utterances from u

to s

into account, rather than just concatenating u

with

last system utterance s

n-1

. The tricks are as formula (5-

6).

In this way, the results shown in formula (7) can

be obtained one by one. Then we can obtain 2*(n-1)

pairs.

}),(,),,({

)1*(2

111111



−

−−−−

ususuuuu

nnnnnn

BCBCinputs

(7)

We will then use two types of RNN to further pro-

cess the concatenated utterance pairs. The first one is

used to encode the relationship between words in sin-

gle utterance pair, calling tokens-level RNN; the sec-

ond one is used to integrate information about all ut-

terance pairs, calling utterances-level RNN.

2.2 RNN over Utterance Tokens

In order to get the relationship between word tokens

in the dialogue utterance, we feed the pre-trained

BERT model with the concatenated pairs (C, B). The

outputs are represented by H with k vectors of 64*768

dimension, in formula (8), where, k is the pairs of the

historical utterances, 64 is the sequence number after

padding, 768 is the default size of hidden layer in

BERT’s outputs.

nnnn

ususk

usus

uuuu











−−

),BERT(

(8)

We use the BASE version of pre-trained BERT in

our test, and the parameters of BERT model are fixed

during training, for the concern of computing speed.

The outputs of BERT model are introduced into the

tokens-level RNN, which is a BiGRU model, for fine

tuning. Specific encoding results are shown as for-

mula (9):

kbkbkb

kfkfkf

bbb

fff

bbb

fff











)(BiGRU),,,,,,,(

2121

2u2

1u1

Hoooooo

(9)

In the above equations, each o with subscript f is

the result hidden layer of BiGRU forward propaga-

tion calculation, and each o with subscript b is the re-

sult of BiGRU backward propagation calculation, in

our experiment, the size of o is set to 64, and l is the

number of input sequence after padding set to 64.

An Efﬁcient Approach based on BERT and Recurrent Neural Network for Multi-turn Spoken Dialogue Understanding

795

2.3 RNN over Utterance Context

When parsing the concatenated utterance pairs, we

consider following formula (10):

];;[

(,,

]

)

;;[

(,,

]);;[

(,,

)(len

222

)(len

111

)(len

kbkfus

bfus

bfuu

ooooBhh

⊕⊕×=

−

(10)

Where

and are diagonal matrix gener-

ated from Boolean vectors built from formula (6),

those diagonal matrix have size 64*64.

is the con-

catenate operator, the size of o is set to 64, and the

size of those concatenated vectors will be 128. Vec-

tors wrapped with bracket make up a matrix, the ele-

ments in the bracket represent the row vectors that

make up the matrix. In our experiments, the concate-

nated matrix size is 64*128.

is the selection opera-

tor used to fetch out all none-zero row vectors from a

matrix. It is easy to deduce that the number of non-0

row vectors of each matrix is equal to the number of

non-0 elements on the diagonal line of

, which

equals to len(u

), the number of tokens in the utter-

ance u

, so the fetching operation finally allow us to

obtain len(u

) row vectors. Each of these vectors rep-

resents the embedding of a special word in the current

user utterance, with the attention information.

The outputs of formula (10) are then used as the

inputs of the utterances-level RNN to encode the slot-

tag of each tokens in user utterance.

The prediction of intents and actions needs differ-

ent information on context from that of slots. The ut-

terances-level RNN used for the prediction of intents

and actions does not share parameters whereas it does

for that of slots. In each row in formula (9), the last

hidden layer in both directions is concatenated to en-

code information for the whole utterances pair. In this

way, we proceed a concatenation of (O

, O

) to get

vectors O

(n=1,...,k) for full text understanding,

shown in formula (11).

kfk











⊕=

ooO

222

111

(11)

Then we take all vectors from

to O

as inputs

to the utterances-level RNN model, and take the last

hidden layer as context embedding, which is ex-

pressed in formula (12).

),,...,,(GRU

121g kk

OOOOG

−

(12)

Note that in formula (12), the output G used is the last

hidden layer of GRU

, which gives the classification

or prediction of intents and acts.

For the prediction of slots, the selected attention

distribution is used as input to the utterances-level

RNN, given by the formula (13).

),,,,(GRU

121S

hhhhS

−

= "

(13)

In formula (13),

is the last hidden layer of GRU

its output gives information for slot prediction corre-

sponding to the

j-th token in the current user utter-

ance, and

j satisfies .

Thus, the named-entity information of each word

in current utterance can be obtained with the attention

information taking into account of the dialogue con-

text, as shown in formula (14):

},,,,{

)len(1)len(21

SSSSS

−

= "

(14)

The G and S obtained above are used for the for

the determination of intents, acts and slots, computed

as formula (15), (16) and (17), the same as that used

by MemNet and SDEN.

)Softmax( UGP =

Intent

(15)

)Sigmoid( VGP =

Act

(16)

Inspired by memory network and SDEN, we take

the value of

S as the inputs of another BiLSTM model,

take the value of G as the original hidden layer h(0)

of this model, Then we put the result of its output

layer into the Softmax layer to get the named-entity

prediction of each word in the current utterance.

))|LSTM(Softmax(Bi

Slot

GSP =

(17)

In formula (17),

means the estimated proba-

bility vector of the

i-th word in current utterance, each

element in the vector represents the probability that

the

i-th word belongs to the corresponding entity cat-

egory. In the expression BiLSTM(S|G)

, i means the

i-th hidden layer of BiLSTM as output, the parameter

G coming from formula (12) means initial hidden

layer of BiLSTM.

We compute the loss based on cross-entropy for

each sub-task, take the sum of them as the total loss,

and we optimize our model based on the total loss.

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

796

3 EXPERIMENTS

In order to verify the effectiveness of the model, we

use the dataset Sim-R and Sim-M for training and test,

the same preformed with those used by MemNet and

SDEN.

3.1 Dataset

The datasets Sim-R and Sim-M (Shah P et al. 2018)

are widely used in context-based intents, acts and

slots joint recognition tasks.

Sim-R is a dataset in

multi-turn conversation for the restaurant domain, the

training set contains 1116 dialogues, 11234 interac-

tions; and Sim-M is a dataset in multi-turn conversa-

tion for the movie domain, the training set contains

384 dialogues, 3562 interactions. Table 1 gives a

glance at this dataset.

We combine together two training sets respec-

tively into one for the training, and then test the model

using uncombined test set and combined test set. In

some cases, when user utterances appear at the begin-

ning of the dialogue without act labels, the corre-

sponding act label is set to be "OTHER".

3.2 Baselines

For benchmark, we use a 2-layer RNN model without

considering the context information, and also Mem-

Net and SDEN. For MemNet and SDEN.

In order to study the effectiveness of each compo-

nent of the proposed model MSDU, we made survey

specific ablation tests. In the first case, we remove

both the BERT module and the concatenate process.

In the second case, the BERT module was replaced

by random initialized words embedding. In the third

case, we don’t concatenate the sentence explained

above for the BERT module. And finally, we proceed

the concatenation of sentences with the BERT mod-

ule. We used also CRF module with MSDU for slots

recognition. In the following, we give a more detailed

explanation above the different models used for com-

parison.

NoContext: Regardless of dialogue information from

context, the model’s structure is the same as the cur-

rent utterance processing module in MemNet and

SDEN. A two-layer RNN structure consist of one

GRU and one LSTM is adopted. The difference is that

This dataset could be downloaded from http://github.

com/google-research-datasets/simulated-dialogue

Available at https://github.com/facebookresearch/fastT

ext/blob/master/docs/crawl-vectors.md.

the initial hidden layer of LSTM is all-zero vector, so

it does not contain any information about context.

MemNet: The attention scores of current sentence

vectors and historical sentence vectors are calculated

based on cosine similarity, context vector is the sum

of historical sentence vectors weighted by their atten-

tion scores. The current sentence processing module

is the same as

NoContext, and the word embedding

is randomly initialized.

MemNet+FastText: The MemNet model usedwith

word embedding matrix initialized with 300-dime

nsional pre-trained word embedding from FastTe

(E. Grave, P. Bojanowski et al. 2018).

SDEN: It is a modification of MemNet with ran-

domly initialized word embedding. It calculates the

attention between the current utterance vector and the

historical ones using a linear full connection layer,

and inputs the attention vectors into a GRU model for

obtaining the context vector.

SDEN+FastText: It is a SDEN model with the

word embedding matrix initialized with 300-dime

nsional pre-trained word vectors FastText.

MSDU-BERT-Concat: This MSDU model does

not use BERT module nor concatenation process,

it uses hierarchical-GRU to encode context mem

ory.

MSDU-BERT: In this case, a random initialized

300-dimensional words embedding matrix is used

instead of BERT module.

MSDU-Concat: No dialogue utterances concate-

nation used with the MSDU model, we

embedded

each historical utterance into a 300-dimensional

vector using BERT-GRU. Then all

these vectors

are embedded into a single 300-dimensional

vector to GRU to characterize context memory.

Further, we use this vector as first hidden state

of GRU when processing current utterance.

MSDU+CRF: It is a MSDU, in the process of

prediction, we use Viterbi algorithm to solve the

sequence with the highest probability.

3.3 Training and Evaluation

The hyperparameters of all models are set as follows:

 Batch_size: 64

 Dropout ratio: 0.3

 Word embed size: 64

 Hidden size for sent encoding: 64

An Efﬁcient Approach based on BERT and Recurrent Neural Network for Multi-turn Spoken Dialogue Understanding

797

Table 1: Profile of datasets used in the experiments, with values of intents, acts, slots, and number of dialogues

(A dialogue may include many turns of interactions between user and system).

Dataset Intents Acts Slots No.Train No.Dev No.Test

Sim-R

FIND_RESTAU-

RANT,

RESERVE_RES-

TAURANT

THANK_YOU, IN-

FORM, AFFIRM,

CANT_UNDERSTAND,

REQUEST_ALTS, NE-

GATE, GOOD_BYE,

OTHER

price_range, lo-

cation, restau-

rant_name, cate-

gory, num_peo-

ple, date, time

1116 349 775

Sim-M

BUY_MOVIE_TI

CKETS

OTHER, GREETING,

GOOD_BYE, CANT_UN-

DERSTAND,

THANK_YOU, NEGATE,

AFFIRM, INFORM

theatre_name,

movie, date,

time,

num_people

384 120 264

Table 2: SLU results on test sets with baselines and MSDU, when trained on Sim-M + Sim-R, "Overall" means

the test set is Sim-R + Sim-M. Because any of the above models can add CRF module, we did not consider M

SDU+CRF when marking the maximum value using bold print.

Model

Intent F1 Act F1 Slot F1 FrameAcc

Sim-R Sim-M Overall Sim-R Sim-M Overall Sim-R Sim-M Overall Sim-R Sim-M Overall

NoContext

82.04 68.47 78.33 88.37 88.74 88.48 97.56 94.70 96.64 71.13 45.89 63.96

MemNe

99.82 98.39 99.44 94.97 89.35 93.38 97.64 94.00 96.56 86.90 65.90 80.88

MemNet+FastText

99.50 99.63 99.56 91.80 89.56 91.18 97.39 94.44 96.55 83.76 67.23 79.13

SDEN

98.08 98.75 98.35 92.66 87.50 91.16 97.59 94.21 96.59 85.36 65.76 79.29

SDEN+FastTex

99.71 99.85 99.73 89.26 90.65 89.61 97.39 94.78 96.59 83.56 69.06 79.44

MSDU-BERT

-Concat

99.88 99.93 99.90

96.57 92.16 95.27 96.55 93.86 96.55 88.10 67.67 82.29

MSDU-BERT

99.80

99.93

99.85

96.40 90.58 94.75 97.68 95.84 97.13 87.25

73.53

83.40

MSDU-Conca

99.50

99.93

99.62

96.89 91.82 95.39

98.20

97.22

97.90

88.32

76.76

85.02

MSDU

99.85

99.93

99.88

96.94 92.16 95.56

98.01

97.33

97.81

88.68

78.30

85.73

MSDU+CRF

99.88 99.93 99.90 97.20 92.30 95.76 98.00 97.09 97.75 89.26 78.30 86.19

For all models, we use the same ADAM opt

imizer. The initial learning rate is set to 0.001,

which decreases to 0.0001 after 125th epoch and

0.00001 after 250th epoch, betas=(0.9, 0.999),

eps=1e-8.

The results are evaluated for

verification set on every epoch. The model

saved when the FrameAcc breaks the

historical

record, and the training process is terminated

after 500epoches. We used the last saved

model for test.

Figure 3 shows the main performance rates of

MSDU with the training epochs. The F1-score of in-

tention recognition reached a high level even at the

first epoch, and the performance curves stay at their

highest level as soon as the 10th epoch.

Figure 3: Main performance measures on evaluation set

change with the training epochs.

0,2

0,4

0,6

0,8

012345678910

Intent_F1 Act_F1

slot_F1 Frame_Acc

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

798

3.4 Results and Analysis

Table 2 shows the results of each model with the re-

spectively intents, acts and slots recognition. The last

column of FrameAcc shows the proportion of correct

recognition of intents, acts and slots for each model

used in the test experiment. The second row lists the

test data set used, and “Overall” represents the new

test set combination of Sim-R and Sim-M.

For MemNet and SDEN, we find that the model

using random initialized word embedding gives better

performance on sim-R dataset with larger sample

size. However, with the sim-M dataset with smaller

sample size, the model with the pre-trained word em-

bedding is more satisfied.

For the recognition of intent, the model NoCon-

text is significantly worse than all other models. This

can explain that the task of intent recognition is more

dependent on context. Due to the introduction of con-

textual information, all other models obtain high ac-

curacy in intent recognition. MSDU model achieves

obviously the best results compared with other mod-

els.

For the task of act recognition, the performance of

NoContext is still lower than other models, which

proves that the information from context is still help-

ful. The performance of MSDU in the act recognition

is obviously better than that of other models, that

means MSDU has a stronger ability in understanding

the relationship between context and the current user

utterance.

For the recognition of slot tagging, there is no sig-

nificant difference of performance for the Models

MemNet, SDEN and NoContext. In the other hand,

MSDU and its variant models achieve better results.

At the same time, we also find that MSDU-Concat is

nearly the same as MSDU for slot recognition, mean-

ing that the concatenation process is not very useful

for slot recognition improvement.

From the test results, we find that the MSDU

model achieves about 5% better for FrameAcc than

MemNet and SDEN models.

It is interesting to notice that SDEN does not ob-

tain a better result than MemNet even although the

forth one using a more complex context encoding

method. MSDU-BERT-Concat and above two mod-

els use the same random initialized word embedding

method. The difference lies mainly in term of context

encoding: the model MSDU-BERT-Concat uses a hi-

erarchical-GRU to encode context information,

which is even simpler than the context encoding

method used by MemNet, however it obtains about

2% better for FramAcc than MemNet and SDEN.

This causes a doubt for the necessity of attention

mechanism in context encoding.

From the results produced by the MSDU variant

models, we can also conclude that the concatenation

procedure brings about 1.1% of improvement, the

BERT module brings about 2.7%, and

the combina-

tion of the both gives 3.4% of improvement.

4 CONCLUSIONS AND FUTURE

WORKS

The MSDU model is proposed for the recognition of

intents, acts and slots with the historical information

in a multi-turn spoken dialogue through training with

different datasets and variant modification. The test

result shows that the design concept of MSDU model

is more effective and brings important improvement.

For future works, we will study how to apply this

new model architecture for higher level dialogue un-

derstanding tasks, such as ontology-based slot recog-

nition, and the alignment of intent-act-slot. For the

moment, we have not discussed the subordinate rela-

tionship among intents, acts and slots, which is essen-

tial to dialogue understanding.

REFERENCES

Ankur Bapna, Gokhan Tür, Dilek Hakkani-Tur and Larry

Heck. 2017. Sequential Dialogue Context Modeling for

Spoken Language Understanding. arXivpre-

print.arXiv:1705.03455.

B.Liu and I. Lane. 2016. Attention-based recurrent neural

network models for joint intent detection and slot fill-

ing.arXivpreprint.arXiv:1609.01454.

Bordes, Y. L. Boureau, and J. Weston.2017. Learning end-

to-end goal-oriented dialog. In Proceedings of the 2017

International Conference on Learning Representations

(ICLP).

Yun-Nung Chen, Dilek Hakkani-Tür, GokhanTür et al.

2016. End-to-End Memory Networks with Knowledge

Carryover for Multi-Turn Spoken Language Under-

standing. In Proceedings of the 2016 Meeting of the In-

ternational Speech Communication Association.

DilekHakkani-Tür, GokhanTür, AsliCelikyilmaz, Yun-

Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang.

2016. Multi-Domain Joint Semantic Frame Parsing us-

ing Bi-directional RNN-LSTM. In Proceedings of the

2016 Annual Conference of the International Speech

Communication Association.

E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov,

et al. 2018. Learning Word Vectors for 157 Lan-

guages. In Proceedings of the 2018 International Con-

ference on Language Resources and Evaluation(LREC).

An Efﬁcient Approach based on BERT and Recurrent Neural Network for Multi-turn Spoken Dialogue Understanding

799

H. Zhou, M. Huang, and X. Zhu. 2016. Context-aware nat-

ural language generation for spoken dialogue systems.

In Proceedings of the 2016 International Conference on

Computational Linguistics (ICCL), pages 2032–2041.

Jacob Devlin, Ming-Wei Chang, Kenton LeeandKristina

Toutanova. 2018. BERT: Pre-training of Deep Bidi

rectional Transformers for Language Understanding.

arXiv preprint. arXiv:1810.04805.

P.Xu and R.Sarikaya. 2013. Convolutional neural network

based triangular crf for joint intent detection and slot

filling. In Proceedings of the 2013 Automatic Speech

Recognition and Understanding (ASRU). IEEE Work-

shop on. IEEE, pp. 78–83.

Raghav Gupta, Abhinav Rastogi, and Dilek Hakkani-Tür.

2018. An Efficient Approach to Encoding Context for

Spoken Language Understanding. arXiv preprint.

arXiv:1807.00267.

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston and

Rob Fergus. 2015. End-to-end Memory Networks. In

Proceedings of the 2015 Conference on Neural Infor-

mation Processing Systems (NIPs).

Shah P, Hakkani-Tür, Dilek, Tür, Gokhan, et al. 2018.

Building a Conversational Agent Overnight with Dia-

logue Self-Play. arXiv preprint. arXiv:1801.04871.

T.-H. Wen, M. Gasic, N. Mrksic, P.-H. Su,D. Vandyke, and

S. Young. 2015.Semantically conditioned lstm-based

natural language generation for spoken dialogue sys-

tems. In Proceedings of the 2015 Conference on Empir-

ical Methods in Natural Language Processing

(EMNLP).Association for Computational Linguistics,

pages 1711–1721, Lisbon,Portugal.

https://www.aclweb.org/anthology/D15-1199.

T-H. Wen, D. Vandyke, N. Mrksic, M. Gasic, L. M.Rojas

Barahona, P.-H. Su, S. Ultes, and S. Young.2017. A

network-based end-to-end trainable task-oriented dia-

logue system. In Proceedings of the 15th Conference of

the European Chapter of the Association for Computa-

tional Linguistics: Volume 1, Long Papers. Association

forComputationalLinguistics, pages438–449, Valencia,

Spain.https://doi.org/10.18653/v1/E17-1042.

Williams. 2013. Multi-domain learning and generalization

in dialog state tracking. In Proceedings of the 2013 An-

nual Meeting of the Special Interest Group on Dis-

course and Dialogue (SIGDIAL).Association for Com-

putational Linguistics, pages 433–441.https://

www.aclweb.org/anthology/W13-4068.

X.Zhang and H.Wang. 2016. A joint model of intent deter-

mination and slot filling for spoken language under-

standing. In Proceedings of the 2016 International Joint

Conference on Artificial Intelligence (IJCAI).

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

800