Is Your Chatbot Perplexing?: Conﬁdent Personalized Conversational

Agent for Consistent Chit-Chat Dialogue

Young Yun Na

1,∗ a

, Junekyu Park

2,∗ b

and Kyung-Ah Sohn

2,3,† c

Department of Media, Ajou University, Suwon, Gyeonggi, Republic of Korea

Department of Artiﬁcial Intelligence, Ajou University, Suwon, Gyeonggi, Republic of Korea

Department of Software and Computer Engineering, Ajou University, Suwon, Gyeonggi, Republic of Korea

∗

These authors contributed equally,

†

Corresponding author

Keywords:

Chatbot, Dialogue, Personalized, Conﬁdence, Natural Language Processing.

Abstract:

Chatbots are being researched and employed not only in academic settings but also in many ﬁelds as an

application. Ultimately, conversational agents attempt to produce human-like responses along with dialogues.

To achieve this goal, we built a novel framework that processes complex data consisting of personalities

and utterances and ﬁne-tuned a large-scale self-attention-based language model. We propose a consistent

personalized conversational agent(CPC-Agent) for the framework. Our model was designed to utilize the

complex knowledge of a dataset to achieve accuracy and consistency. Together with a distractor mechanism,

we could generate conﬁdent responses. We compared our model to state-of-the-art models using automated

metrics. Our model scored 3.29 in perplexity, 17.59 in F1 score, and 79.5 in Hits@1. In particular, the

perplexity result was almost four times smaller than that of the current state-of-the-art model that scored

16.42. In addition, we conducted a human evaluation of each model to determine its response quality because

the automatic evaluation metrics in dialogue tasks are still considered insufﬁcient. Our model achieved the

best rates from the voters, which indicated that our model is adequate for practical use.

1 INTRODUCTION

The need for automatic-response generation in

machine-to-human interaction is increasingly being

emphasized. However, few studies have achieved a

sensible quality of auto-generated responses. Con-

ventional chatbots, which are human interactive

response-generating machines, are built based on

tree structures that can only produce pre-scripted re-

sponses. To train chatbots that generate abstractive

and human-like responses, recent studies have em-

ployed not only deep-learning approaches but also

different dataset formats. Since the appearance of

the transformer-based models (Vaswani et al., 2017),

such as BERT (Devlin et al., 2018) and GPT (Rad-

ford et al., 2018), deep-learning-based language mod-

els have been successful in most natural language

processing(NLP) ﬁelds. However, the dialogue-

generation task is considered one of the tasks that

have yet to be mastered. Recent deep-learning-based

https://orcid.org/0000-0001-7051-3766

https://orcid.org/0000-0003-2984-9027

https://orcid.org/0000-0001-8941-1188

chatbot models are trained with an utterance text

dataset. However, training the model using only ut-

tered text can make it generate style-impersistent re-

sponses and suffer from the lack of long-term context

information.

In this paper, we propose a contextually consis-

tent GPT-based chatbot model that uses PERSONA-

CHAT dataset (Zhang et al., 2018) that contains an

additional personality as consistency information. An

example of a dialogue is shown in Figure 1. For ex-

periments, we evaluated the model using F1, perplex-

ity(PPL), and Hits@1 as automated metrics and con-

ducted a human evaluation on the quality of the auto-

generated responses. The main contributions of our

model are as follows:

• We built a GPT-based deep trainable conversa-

tional agent that could generate abstractive re-

sponses and engage in interactive dialogue.

• We ﬁne-tuned a dialogue-dedicated model that

was pre-trained using a large-scale conversational

dataset.

• We carefully designed a training mechanism to

process personality and utterance input and com-

1226

Na, Y., Park, J. and Sohn, K.

Is Your Chatbot Perplexing?: Conﬁdent Personalized Conversational Agent for Consistent Chit-Chat Dialogue.

DOI: 10.5220/0010385312261232

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 2, pages 1226-1232

ISBN: 978-989-758-484-8

Figure 1: CPC-Agent chit-chat sample. CPC-Agent is a

chatbot that generates responses when a user inputs sen-

tences. it can also keep track of the conversation history.

As a result, CPC-Agent can generate consistent and person-

alized sentences.

pared our model to the state-of-the-art models

through automatic evaluation and human evalua-

tions. Consequently, our model outperformed the

state-of-the-art models in terms of PPL score and

human evaluation.

The rest of this paper is organized as follows.

In section 2, background of natural language gener-

ation and works related to the dialogue task are re-

viewed. Section 3, describes the details of the model

framework proposed. Section 4, discuss the personal-

ized dataset and model analysis using both automatic

metric and human evaluation. Lastly, the conclusion

comes in section 5.

2 BACKGROUND

Deep learning-based language models in a dialogue

task, are trained to generate interactive responses.

Language models are mainly divided into mod-

els before and after the appearance of the trans-

former. The models before the transformer are mainly

based on recurrent neural networks (RNNs), such as

LSTM (Hochreiter and Schmidhuber, 1997) and GRU

(Chung et al., 2014). By employing the large capac-

ity of RNN-based models, (Vinyals and Le, 2015)

demonstrated acceptable responses in open-domain

settings and (Serban et al., 2016; Miao et al., 2016;

Sordoni et al., 2015) utilized the latent state from

RNN models to generate abstractive responses.

More recent language models that adopted the

transformer architectures have demonstrated domi-

nant performance in most NLP tasks. (Mazar

e et al.,

2018) incorporated additional data from the ‘Red-

dit’ dialogue chit-chat to train a transformer-based

model. (Wolf et al., 2019a) proposed a ﬁne-tuned

transformer-based language model and illustrated ex-

tended generative experiments. In the case of an

open-domain dialogue, these models are incapable

of generating accurate responses because there is no

ﬁxed topic across the conversation.

To further train the conversational model, (Zhang

et al., 2018) presented a problem in which the dataset

is the one that needs to be handled. A novel

PERSONA-CHAT dataset was also proposed. In con-

trast to a mere utterance dataset, the PERSONA-

CHAT dataset contained additional personality of

each speaker, which could make the model gener-

ate consistent responses. To facilitate the personality

information, (Golovanov et al., 2020) ﬁne-tuned the

GPT model pre-trained on manually labeled ‘Daily-

Dialog’ using transfer learning. P

Bot, which was

proposed by (Liu et al., 2020), underwent an ad-

ditional reinforcement learning approach. The P

Bot processed the personality information using the

BERT architecture that learned from word-piece rep-

resentation and processed the utterance-based con-

versational information using the GPT architecture

which learned from sentence-piece representation.

This separate processing pipeline required the model

to share the context through reinforcement learning.

The encoding methods in each processing pipeline

also differed. Thus, sharing the context of each

propagated input was difﬁcult. In contrast to P

Bot, Trans f erTrans f o model which was proposed

by (Wolf et al., 2019b), employed a different ap-

proach to process the personalities along with the ut-

terances. Trans f erTrans f o obtained a higher accu-

racy by training the GPT model with a distractor by

using other candidate utterances but simply used the

embedding form of persona with utterances of agents

as an input. However, this form of input did not con-

sider the dialogue history. Thus, it still suffers from

limitation in terms of remembering long-term mem-

ory.

3 MODEL STRUCTURE

We propose a framework to train the generative model

for a dialogue task. Our model is designed to process

complex data information in a sophisticated and ef-

ﬁcient manner. To profoundly train the model, we

adopt the GPT-based model which utilizes a self-

attention mechanism. Overall, we train the model

through two combined losses using the property of

distinct contexts gathered in the encoded embedding.

Is Your Chatbot Perplexing?: Conﬁdent Personalized Conversational Agent for Consistent Chit-Chat Dialogue

1227

Figure 2: Overall structure of CPC-Agent. The CPC-Agent is trained using the PERSONA-CHAT dataset, which contains

additional personality information. The inputs are the concatenated form of personalities, history of utterances, and utterances

on both true and false responses that should be generated. The input is processed using the GPT-based model with two

combined losses; generation and distraction losses. The training proceeds as a total loss plus generation loss and distraction

loss. training The blocks in the model are transformer units.

This structure is shown in Figure 2. The CPC-Agent

is implemented using three different components as

explained below.

3.1 Training with Personality

We propose to train the model using a complex dataset

consisting of personalities and utterances. Personali-

ties can be deﬁned as the persona of each agent and

the utterances represent the dialogue of two different

agents A and B. In the case of Trans f erTrans f o,

used input that can be formulated as a concatenation

of each vector which does not consider about history;

[p, u

, u

Unlike Trans f erTrans f o, the CPC − Agent

model takes the utterance history into account as ad-

ditional input information and processes the whole in-

put using an integrated single model to facilitate con-

text sharing. To fully understand and process the en-

tire context from the given information, because the

input format is x

with n

response generation, we

consider the persona and history of dialogue and ut-

terance of the agent B as one response using the gen-

erated sentence and distraction d

which we will ex-

plain in Section 3.3. This input format is described as

follows:

= [p, h

, u

, d

n−1

∑

i=1

, u

(1)

github link : https://github.com/ﬁghtnyy/CPC-Agent

where n is the number of pairs of utterances from the

conversational agents so far, and h denotes the his-

tory of conversation before n. History provides the

whole text of the dialogue sequences. By processing

the given input format, the model can take the best ad-

vantages of the persona and all of the conversational

information.

3.2 Generative Loss

The GPT-based models have been showing the best

performance in most NLP ﬁelds. In particular, they

demonstrated remarkable results in the ﬁeld of text

generation. As a result, DialoGPT was introduced by

(Zhang et al., 2019) as a pre-trained model to generate

interactive responses. For the structure of the model,

we ﬁne-tuned the model of the DialoGPT structure

with the pre-trained weights. By utilizing the ad-

vantages of the models that used the self-attention

mechanism, we were able to introduce x

into a com-

bined form of embedding from p, h

and u

by us-

ing special tokens < SP1 > and < SP2 > as separa-

tors. For training, the CPC-Agent processed the input

using a self-attention mechanism with the embed-

ding dimension size dim. The model obtained out-

put o

∈ R

dim

, which was a vector for a word i with

the context from the persona, dialogue history, and

utterance of agent B. The next layer was a regress-

ing layer for generation that yielded q

∈ R

, where

v is the size of the vocabulary. Finally, the sigmoid

values are taken from a regression distribution q

, and

the k highest values are extracted from the probability

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

1228

distribution. Denoting M as the number of generated

words to consist dialogue

and N as the number of

generated dialogue

to consist Dialogue

, the objec-

tive function for generating the complete dialogue is

as follows:

P(Dialogue) =

∏

n=1

P(dialogue

) =

∏

i=1

P(pred

(2)

The reason for extracting the ﬁnal output in the prob-

ability distribution of top k rather than extracting the

highest one is that the model was designed to look

and learn from the top k different candidates. Finally,

we used the cross-entropy loss to train the model in

which generative loss, L

Gen

is deﬁned as:

Gen

−1

∑

i=1

[log pred

· label

], (3)

where pred

is the next prediction of the model and

label

as is the target word to generate at iteration i.

3.3 Distraction Loss

In the open-domain-dialogue tasks, a problem came

up when sentences that have nothing to do with the

previous conversations were generated. Inspired by

(Zhang et al., 2018), we added distracting factors dur-

ing the training to resolve the inconsistency issue by

training CPC-Agent to distinguish the distraction and

true utterance. For the distraction calculation, we

used the binary cross-entropy loss as follows:

Distract

−1

∑

i=1

[label

· log pred

+(1 − label

) · log(1 − pred

)],

(4)

where pred

is the model prediction of whether u

and d

are either distraction or real response utter-

ance, and label

is the ground truth of the prediction.

For the total loss L

total

for training the whole learn-

ing sequence, the model is trained by the combined

generation and distraction losses as follows:

total

= L

Gen

+ α · L

Distract

, (5)

where α is a hyperparameter as a regulating fac-

tor. From these generation and distraction losses, we

could train the model using complex input informa-

tion.

4 EXPERIMENT

Chatbot models for chit-chat go through the phe-

nomenon of getting lost and generating irrelevant re-

sponses during a spoken conversation process with-

out a particular topic. For consistency of conver-

sation, the PERSONA-CHAT dataset was proposed

by (Zhang et al., 2018), which could personalize

the model when generating responses. Therefore,

we trained our model using a PERSONA-CHAT

dataset. The PERSONA-CHAT dataset was ﬁrst

used for training the model in ConvAI2 Competition,

which created the PERSONA-CHAT dataset bench-

mark data, and we were able to compare our model

with the scores claimed by the models trained using

the persona data. The PERSONA-CHAT dataset con-

sisted of 1,155 personalities where each described at

least ﬁve sentences. For training, a total of 162,024

utterances in 10,907 dialogues were used. For test-

ing, a total of 15,024 utterances over 968 dialogues

were used.

To evaluate the effectiveness of our approach, we

conducted automatic and human evaluations. In the

automatic evaluation, we tested the models with mul-

tiple automatic metrics. In addition, as in the human

evaluation method of (Liu et al., 2020), we conducted

an experiment with 200 voters above-university ed-

ucational level to determine how consistently CPC-

Agent would respond to the persona. We also exam-

ined how the automatic evaluation results varied by

adjusting the values of hyperparameter α. Since test-

set of PERSONA-CHAT dataset is unavailable all test

is conducted by using dev-set of PERSONA-CHAT.

4.1 Automatic Evaluations

We used three different automatic metrics; F1 score,

perplexity(PPL), and Hits@1 for automatic evalua-

tions of the models. The perplexity score represents

the measurement of the negative log likelihood of the

correct sequence output by the model. This score in-

dicates the number of words the model considers for

the prediction of the next words. The Hits@1 score

represents the probability-based measure that ranks

the real response from the highest according to the

model. The F1 score represents the harmonic mean

of word-level precision and recall.

The comparison results of CPC-Agent with those

of the prior models tested on the same data under

the same setting are listed in Table 1. Our model

approach was comparable to all the baselines and

advanced models. Particularly we achieved a new

state-of-the-art performance on PPL scoring at 3.29,

achieving a huge improvement, which was the low-

Is Your Chatbot Perplexing?: Conﬁdent Personalized Conversational Agent for Consistent Chit-Chat Dialogue

1229

Table 1: Automatic evaluation of the models tested using the PERSONA-CHAT dataset; perplexity is denoted as PPL.

Seq2Seq + Attention denotes the baseline in ConvAI2 Competition. All models are evaluated in terms of perplexity, F1,

and Hits@1 as generated-response-quality measurements. Our model outperforms every other model in terms of perplexity.

Model PPL(Perplexity) F1(%) Hits@1(%)

BOT 15.12 19.77 81.9

Trans f erTrans f o 17.51 19.09 82.1

Lost In Conversation 17.3 17.79 17.3

Seq2Seq + Attention 35.07 16.82 12.5

CPC-Agent(ours) 3.29 18.89 79.5

Table 2: Human evaluation of the models tested using PERSONA-CHAT dataset. Each output of the models is rated in four

different degrees, namely, basic, good, better, and excellent, through human voting. The average scores are calculated as

averaged ratings from each degree. Our model scored the best in terms of excellent grade and average of ratings.

Model

Voting rate (%) Score

Basic: 1 Good: 2 Better: 3 Excellent: 4 Average

BOT 18.9 26.3 28.6 26.2 2.621

Trans f erTrans f o 41.7 25.3 28.7 4.3 1.956

Lost In Conversation 26.3 48.7 22.0 3.0 2.017

CPC-Agent(ours) 19.1 26.5 23.5 30.9 2.662

est among the compared models. It was more than

10 times smaller than the baseline PPL score. Apart

from the other metrics, the reason that PPL appeared

particularly good was ﬁrst, a large-scale pre-trained

representative mode was used, and second, the model

itself was more robust with the insertion of a distrac-

tor during the training. This result veriﬁed that our

model exhibited better conﬁdence in deciding the next

words. The F1 score of our model was 18.89, which

was competitive with that of the comparing models,

but performed better than the Lost In Conversation

and baseline models. In terms of Hits@1, our model

scored 79.5, which largely differed from that in

Lost In Conversation and the baseline models. It did

not outperform the score of 81.9 from p

Bot and 82.1

from Trans f erTrans f o but the differences were an

acceptable limit.

4.2 Human Evaluation

In particular, in the dialogue task, even if the model

obtained a good score on the automatic metric, it does

not mean the performance of the model was much bet-

ter. Automatically evaluating polyphonic or ﬂuency

was very difﬁcult. For example, “I love my dog” and

“puppy is liked by me” mean almost the same thing

for a human. However, neither F1 nor Hits@1 con-

sidered them to have the same meaning. As an ex-

ample of this, Table 1 lists that Trans f erTrans f o and

Lost In Conversation occupied the ﬁrst and second

places, respectively, in the NeurIPS 2018 dialog com-

petition ConvAI2. Trans f erTrans f o was rated as a

better model than Lost In Conversation in all scores.

but the human evaluation of these models showed op-

posite results. Hence, we conducted an additional hu-

man evaluation on our model.

Table2 lists the human evaluation of the compared

models using the PERSONA-CHAT dataset. We re-

quested people who had a higher university education

level to evaluate our model. In the survey, each option

was labeled as follows:

• Basic(1): The response is good only in terms of

grammar and sentence structure.

• Good(2): Not only B’s response is consistent but

also the response is coherent with the context.

• Better(3): B’s response is coherent meanwhile in-

teresting and informative, instead of just a simple

response like ”Yes.”

• Excellent(4): The response of B is consistent with

the persona.

• Average: Average of each score multiplied by the

rating portion.

Trans f erTrans f o received the most votes of 41.7%

on the “basic(1)” which indicated the generated re-

sponses from the model might be good but in-

consistent. In terms of “good(2)”, the score of

Lost In Conversation was 48.7% of votes, which was

highest among the compared models. p

Bot did not

obtain the highest in any of the options. In terms

of the “excellent(4)” option, our model obtained the

30.9% of votes from the survey, which implied that

the responses generated by our model were consis-

tent with the personality of the chatbot. Therefore, the

best model for the PERSONA-CHAT dataset criterion

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

1230

was closer to “excellent(4)”. Thus, our model demon-

strated performance beyond p

Bot with a signiﬁcant

difference. The “average” score also illustrated that

the performance of our model exceeded that of the

other compared models, which meant that human felt

most natural when they interacted with CPC-Agent.

4.3 Distraction Factor Adjustment

We recorded the score of the automatic metric accord-

ing to the change in the α value to evaluate the change

in the model performance with respect to the weight

of the model inﬂuence of L

Distract

. The value of α was

set to 0.0(never used), 0.5, 1.0, 3.0, 5.0.

Figure 3: Hits@1 score curve along different each α value

of our model. The x-axis represents the alpha value. The y-

axis shows the score for Hits@1. The score represents the

maximum when the α value is 3.0.

Figure 3 shows that the larger the α value is, the

better is the result of Hits@1 until it reaches the max-

imum. When the α values were exactly 0.0, 0.5, 1.0,

3.0, and 5.0, the Hits@1 values were 60.32, 65.85,

75.37, 79.51, and 76.23, respectively. In particular,

even if our model did not use a distractor, we could

observe that the performance was good with a large

difference between the existing attention mechanism

and Lost In Conversation. The best performance was

achieved when the α value was set to 3.0. However, it

decreased when α value exceeded 3.0.

Figure 4 shows that PPL was most affected by the

α value. In the case of the F1 score, the difference

was insigniﬁcant. When the alpha values were 0.0,

0.5, 1.0, 3.0, and 5.0, the values of PPL were 20.41,

18.23, 5.65, 3.29 and 4.23 respectively, and the F1

values were 16.75, 16.90, 17.06, 18.89 and 17.23 re-

spectively. Even when the distractor was not used, it

easily surpassed the existing baseline. In particular,

when α was 1.0, PPL exhibited the best performance

improvement, and when α was 3.0, it demonstrated a

maximum value. However, when it exceeded 3.0, the

performance decreases. In terms of F1, it appeared

Figure 4: F1 and perplexity score curve along each different

α value of our model. The x-axis represents the α value.

The y-axis shows the perplexity and F1 scores. Both scores

represent the maximum when the α value is 3.0.

that the inﬂuence of distractor is smaller than the other

two metrics. However, the best performance was the

same when α was 3.0.

5 CONCLUSIONS

We took advantage of the pre-trained representative

model, which is known to perform well in the NLP

tasks, and proposed a new GPT-based chatbot model

that could serve as a consistent personalized con-

versational agent. By using a distractor mechanism

in the model, for automatic evaluation, PPL demon-

strated a large performance difference from the exist-

ing state-of-the-art models. Our model also demon-

strated the best performance in human evaluation.

The F1 and Hits@1 scores of our model showed per-

formance comparable to the state-of-the-art models.

Overall, our model can be considered as a state-of-

the-art model on a conversational agent trained us-

ing personality information. However, we have not

achieved weight reduction of the model. During the

experiment, we found out that the number of param-

eters of transformer-based model (e.g, GPT, BERT)

was too large, and in a general environment, it would

be difﬁcult to upload only the GPT model itself, even

if not training GPT. In our future studies, in order to

alleviate the model size problem, a lightweight model

such as distill-gpt2 or distilled BERT or LSTM will

be used to create a model whose parameters are much

lighter than those of the current model.

ACKNOWLEDGEMENTS

This research was supported by the MSIT(Ministry

of Science and ICT), Korea, under the ITRC

Is Your Chatbot Perplexing?: Conﬁdent Personalized Conversational Agent for Consistent Chit-Chat Dialogue

1231

(Information Technology Research Center) support

program(IITP-2020-2018-0-01431) and also under

the National Program for Excellence in SW (2015-

0-00908) both supervised by the IITP(Institute for In-

formation & Communications Technology Planning

& Evaluation).

REFERENCES

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y.

(2014). Empirical evaluation of gated recurrent neu-

ral networks on sequence modeling. arXiv preprint

arXiv:1412.3555.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Golovanov, S., Tselousov, A., Kurbanov, R., and

Nikolenko, S. I. (2020). Lost in conversation: A con-

versational agent based on the transformer and trans-

fer learning. In The NeurIPS’18 Competition, pages

295–315. Springer.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural computation, 9(8):1735–1780.

Liu, Q., Chen, Y., Chen, B., Lou, J.-G., Chen, Z., Zhou, B.,

and Zhang, D. (2020). You impress me: Dialogue gen-

eration via mutual persona perception. arXiv preprint

arXiv:2004.05388.

Mazar

e, P.-E., Humeau, S., Raison, M., and Bordes, A.

(2018). Training millions of personalized dialogue

agents. arXiv preprint arXiv:1809.01984.

Miao, Y., Yu, L., and Blunsom, P. (2016). Neural varia-

tional inference for text processing. In International

conference on machine learning, pages 1727–1736.

Radford, A., Narasimhan, K., Salimans, T., and Sutskever,

I. (2018). Improving language understanding with un-

supervised learning. Technical report, OpenAI.

Serban, I. V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J.,

Courville, A., and Bengio, Y. (2016). A hierarchical

latent variable encoder-decoder model for generating

dialogues. arXiv preprint arXiv:1605.06069.

Sordoni, A., Bengio, Y., Vahabi, H., Lioma, C., Grue Si-

monsen, J., and Nie, J.-Y. (2015). A hierarchical re-

current encoder-decoder for generative context-aware

query suggestion. In Proceedings of the 24th ACM In-

ternational on Conference on Information and Knowl-

edge Management, pages 553–562.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. In Advances in

neural information processing systems, pages 5998–

6008.

Vinyals, O. and Le, Q. (2015). A neural conversational

model. arXiv preprint arXiv:1506.05869.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C.,

Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz,

M., et al. (2019a). Huggingface’s transformers: State-

of-the-art natural language processing. ArXiv, pages

arXiv–1910.

Wolf, T., Sanh, V., Chaumond, J., and Delangue, C.

(2019b). Transfertransfo: A transfer learning

approach for neural network based conversational

agents. arXiv preprint arXiv:1901.08149.

Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., and

Weston, J. (2018). Personalizing dialogue agents: I

have a dog, do you have pets too? arXiv preprint

arXiv:1801.07243.

Zhang, Y., Sun, S., Galley, M., Chen, Y.-C., Brock-

ett, C., Gao, X., Gao, J., Liu, J., and Dolan, B.

(2019). Dialogpt: Large-scale generative pre-training

for conversational response generation. arXiv preprint

arXiv:1911.00536.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

1232