
 
For instance, a remote control has been chosen to 
select videos/cameras to display or pre-defined 
orchestration templates have been used to show 
participants of the meeting. Such exisiting systems 
are unable to manage high number of video streams 
with high level of details, dynamicity in the 
rendering, adaptability to the user intent and 
programmability and flexibility in the orchestration.  
Video orchestration based on “audio events” is 
one way in this direction. Yet, as around 70% of all 
the meaning is derived from nonverbal 
behavior/communication (Engleberg, 2006) useful 
information for video orchestration are missing (i.e. 
gesture, expression, attention,…).  
Al-Hames (Al-Hames and Dielmann, 2006) 
proved that the audio information is not sufficient 
and visual features are essential. Then, Al-Hames 
(Al-Hames and Hörnler, 2006) proposed a new 
approach based on rules applied on low level 
features such as global motions, skin blobs and 
acoustic features. HMMs (Hidden Markov Models) 
have been also used (Hörnler, 2009) for video 
orchestration by conbining low and high level 
features.  
Based on theses observations and inspired from 
(Al-Hames and Hornler, 2007) and (Ding, 2006), we 
will use for our video orchestration a system based 
on HMMs taking as input only high level features 
such as Gesture (Fourati and Marilly, 2012), Motion 
(Cheung and Kamath, 2004), Face expression 
(Hromada et al., 2010), Audio (O’Gorman, 2010). 
The benefit of the use of high level features is to 
solve the problem of programmability of the video 
orchestration during video conferences. Basic users 
can define their own rules transparently and such 
approach improves the user experience, the 
immersion and efficiency of video-conferences. 
3 PROGRAMMABILITY 
Implicit or user intent-based programmability 
capabilities enabling to model video orchestration 
and to smartly orchestrate the displaying of 
video/multimedia streams have been implemented in 
our system. Data used by our HMM engine to model 
the video orchestration are captured through the 
combination of two approaches: visual programming 
and programming by example. In our HMM model, 
the transition matrix A contains transition 
probabilities between diverse camera views; the 
emission matrix B contains emission probabilities of 
each observation knowing the current state or 
screen; the initialization matrix  contains the
 probability for each camera to be showed the first. 
3.1 Solution Description 
Therefore, the “multimedia orchestrator” module, 
part of the videoconferencing system, has been 
augmented by the three following functionalities: 
o  Smart video orchestration capabilities thanks to 
HMMs.  
o  Learning/programmability capabilities. That 
means that the system is able to define 
automatically new orchestration models through 
user intent capture and interactions.  
o  Smart template detection. That means that the 
system is able to recognize the video 
orchestration model that best fits the video 
conference context/scenario and the user profile. 
Figure 2 presents a basic scheme of the solution. The 
engine of the “Multimedia Orchestrator” module is 
based on specific mechanisms (e.g. learning 
mechanisms, scenario recognition,…) integrating 
HMMs. 
 
Figure 2: Basic scheme of the solution. 
The “MM orchestrator” module takes as inputs 
video streams and video/audio events metadata 
(coming for instance form video/audio analyzers 
outputs). Video analyzers enable to detect high level 
video events such as gestures, postures, faces and 
audio analyzers enable to detect audio events such as 
who is speaking, keywords, silence and noise level. 
Initially, based on the first received video and audio 
events metadata such as “speaker metadata”, the 
classifier module selects the template that fits best 
the temporal sequence of events. By default, the user 
can select a model related to the current meeting 
scenario. During the use, the classifier can change 
the model if another one fits better the temporal 
sequence of events.  
This problem of selecting the right model is known 
as recognition problem. Both, Forward algorithm 
(Huang et al., 1990) and Backward algorithm can 
solve this issue. In our MM orchestrator we have 
used the Forward algorithm. Next step after the 
selection of the best template is to select the most 
SmartVideoOrchestrationforImmersiveCommunication
753