3.4  Results and Analysis 
Table 2 shows the results of each model with the re-
spectively intents, acts and slots recognition. The last 
column of FrameAcc shows the proportion of correct 
recognition of intents, acts and slots for each model 
used in the test experiment. The second row lists the 
test data set used, and “Overall” represents the new 
test set combination of Sim-R and Sim-M. 
For MemNet and SDEN, we find that the model 
using random initialized word embedding gives better 
performance  on  sim-R  dataset  with  larger  sample 
size. However, with  the sim-M dataset with smaller 
sample size, the model with the pre-trained word em-
bedding is more satisfied. 
For the recognition of intent, the model NoCon-
text is significantly worse than all other models. This 
can explain that the task of intent recognition is more 
dependent on context. Due to the introduction of con-
textual information, all other models obtain high ac-
curacy in intent recognition. MSDU model achieves 
obviously the best results compared with other mod-
els. 
For the task of act recognition, the performance of 
NoContext  is  still  lower  than  other  models,  which 
proves that the information from context is still help-
ful. The performance of MSDU in the act recognition 
is  obviously  better  than  that  of  other  models,  that 
means MSDU has a stronger ability in understanding 
the relationship between context and the current user 
utterance. 
For the recognition of slot tagging, there is no sig-
nificant difference of performance for the Models   
MemNet, SDEN and NoContext. In the other hand, 
MSDU and its variant models achieve better results. 
At the same time, we also find that MSDU-Concat is 
nearly the same as MSDU for slot recognition, mean-
ing that the concatenation process is not very useful 
for slot recognition improvement. 
From the test results, we find that the MSDU 
model achieves about 5% better for FrameAcc  than 
MemNet and SDEN models. 
It is interesting to notice that SDEN does not ob-
tain  a  better  result  than  MemNet  even  although  the 
forth  one  using  a  more  complex  context  encoding 
method. MSDU-BERT-Concat and above two mod-
els use the same random initialized word embedding 
method. The difference lies mainly in term of context 
encoding: the model MSDU-BERT-Concat uses a hi-
erarchical-GRU  to  encode  context  information, 
which  is  even  simpler  than  the  context  encoding 
method used  by MemNet, however  it obtains about 
2%  better  for  FramAcc  than  MemNet  and  SDEN. 
This  causes  a  doubt  for  the  necessity  of  attention 
mechanism in context encoding. 
From the results produced by the MSDU variant 
models, we can also conclude that the concatenation 
procedure  brings  about  1.1%  of  improvement,  the 
BERT module brings about 2.7%, and
 the combina-
tion of the both gives 3.4% of improvement. 
4  CONCLUSIONS AND FUTURE 
WORKS 
The MSDU model is proposed for the recognition of 
intents, acts and slots with the historical information 
in a multi-turn spoken dialogue through training with 
different datasets and variant modification. The  test 
result shows that the design concept of MSDU model 
is more effective and brings important improvement. 
For future works, we will study how to apply this 
new model architecture for higher level dialogue un-
derstanding tasks, such as ontology-based slot recog-
nition,  and  the  alignment  of  intent-act-slot.  For  the 
moment, we have not discussed the subordinate rela-
tionship among intents, acts and slots, which is essen-
tial to dialogue understanding. 
REFERENCES 
Ankur  Bapna, Gokhan  Tür, Dilek  Hakkani-Tur  and Larry 
Heck. 2017. Sequential Dialogue Context Modeling for 
Spoken  Language  Understanding.  arXivpre-
print.arXiv:1705.03455. 
B.Liu and I. Lane. 2016. Attention-based recurrent neural 
network models for joint intent detection and slot fill-
ing.arXivpreprint.arXiv:1609.01454. 
Bordes, Y. L. Boureau, and J. Weston.2017. Learning end-
to-end goal-oriented dialog. In Proceedings of the 2017 
International Conference on Learning Representations 
(ICLP). 
Yun-Nung  Chen,  Dilek  Hakkani-Tür,  GokhanTür  et  al. 
2016. End-to-End Memory Networks with Knowledge 
Carryover  for  Multi-Turn  Spoken  Language  Under-
standing. In Proceedings of the 2016 Meeting of the In-
ternational Speech Communication Association. 
DilekHakkani-Tür,  GokhanTür,  AsliCelikyilmaz,  Yun-
Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang. 
2016. Multi-Domain Joint Semantic Frame Parsing us-
ing Bi-directional RNN-LSTM. In Proceedings of the 
2016 Annual Conference of the International Speech 
Communication Association. 
E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, 
et  al.  2018.  Learning  Word  Vectors  for  157  Lan-
guages. In Proceedings of the 2018 International Con-
ference on Language Resources and Evaluation(LREC).