the LBS service and At the same time, combining
haptic feedback such as Apple Taptic Engine with
voice reminders for multimodal output can enhance
interaction confidence: when the user drags and drops
the content to the “any door”, the vibration feedback
confirms that the intent was captured successfully.
where the spatio-temporal alignment problem can
be solved by the spatio-temporal alignment attention
mechanism. That is, a cross-modal temporal encoder
is constructed to compute the time offsets of speech
and touch events and dynamically adjust the intent
fusion weights(Tsai, 2019). Alternatively, Dynamic
Time Warping (DTW) is used when solving only the
multimodal timing asynchrony problem caused by
timing signals of different modes that may be out of
sync with the time axis due to acquisition frequency
or physical delays, which is essentially to find the
optimal nonlinear alignment paths between the two
timing sequences and minimize the cumulative
distance. Given two timing sequences X=[x1,. . . ,xN]
and Y=[y1,. . . ,yM], the distance matrix is computed
for each pair of points (xi,yj) with distance d(i,j), and
then the cumulative distance matrix D is computed by
recursion: D(i,j)=d(i,j)+min(D(i-1,j),D(i,j-1),D(i-1,j -
1)), and then from backtracking from D(N,M), find
the least costly alignment path(Sakoe, 1971).
For the semantic complementarity problem, the
early mature techniques include Co-Training and Tri-
Training, where Co-Training refers to two modal
classifiers providing pseudo-labels to each other , and
Tri-Training refers to the introduction of a third
modality as an arbiter , but Multimodal Co-Training
is essentially an improvement of the collaboration
between multiple unimodal models to accomplish a
task, which always remains at a shallow
understanding of semantics, and reasoning based on
subtle associations that are difficult to capture, such
as “rapid speech + frowning = anger”, is not possible.
This greatly hinders accurate prediction of user needs
at the level of intent understanding. Cross-modal
learning, on the other hand, is the mapping of data
from different modalities into the same embedding
space for mutual transformation and mapping, where
one modality extracts information and uses it to
understand or enhance the content of another
modality. It really realizes from “collaboration” to
“integration” between modalities, and realizes the
deep understanding of user semantics. CLIP is the
very classic cross-modal learning approach. CLIP
aligns the image and text representations by means of
the Contrastive Learning algorithm, using a dual-
stream architecture, i. e. , separate architectures for
the different modal encoders. After receiving a batch
of image-text pairs {(I1,T1),(I2,T2),. . . ,(IN,TN)},
where N is the batch size, the image encoder extracts
image features to output image feature vectors, and
the text encoder extracts text features to output text
feature vectors ,and then compares the loss functions
to maximize the similarity of the matched image-text
pairs and minimize the similarity of the mismatched
pairs(Radford, 2021). Another kind of ViLBERT
model that utilizes the Cross-modal Attention
mechanism also improves the intention recognition
accuracy by 15% (Lu, 2019). If a modal conflict is
detected, counterfactual reasoning can be initiated in
conjunction with causal reasoning interventions to
pursue the user's true intentions (Pearl, 2009).
Although cross-modal learning technology is not yet
mature enough, with the development of the
intentional interaction paradigm making a deeper
understanding of the user's intent by the machine an
inevitable requirement, cross-modal learning will
become an important breakthrough direction, and it
will be the technology that the public can expect to
achieve early and widespread commercialization of
the landing.
2.3 Privacy Protection for the Whole
Process of Intent Recognition
In a hypothetical scenario, when a user intercepts
multiple “diabetes diet” articles in a row, the system
triggers a customized recipe recommendation from a
health management app (Devlin et al. , 2018). This
process relies on Federated Learning for privacy
protection, where the raw data is kept locally,
processed by the end-side device, and only uploaded
with model gradient updates (McMahan et al. , 2017)
to the central server. FedAvg algorithm as an example
of a typical federated learning process, the central
server releases the global model 𝑚
selected part of
the client to download , each client with local number
of each client with local data training, to get the local
model𝜽
parameters Clients uploaded to the server
server weighted average, for example, according to
the amount of data to update the global model:
𝜃
=
∑
|
|
∑|
|
𝜃
(1)
Repeat steps 2-3 until the model converges(Mcmahan,
2016).
Of course FL is always being continuously
improved and refined by adding additional techniques
such as knowledge distillation and model
compression techniques, giving birth to models that
move into the target domain and learn in the source
domain, federated transfer learning and personalized
federated learning. Among them, personalized