Real Time System for Gesture Tracking in Psycho-motorial
Rehabilitation
Massimo Magrini and Gabriele Pieri
Institute of Information Science and Technologies, National Research Council, Via Moruzzi 1, Pisa, Italy
Keywords: Gesture Recognition, Tracking System, Autism Spectrum Disorder, Rehabilitation Systems.
Abstract: In the context of the research activities of the Signal and Images Lab of ISTI-CNR, a system is under
development for real-time gesture tracking, to be used in active well-being self-assessment activities and in
particular applied to medical coaching and music-therapy. The system uses a video camera, a FireWire
digitalization board, and a computer running own developed software. During the test sessions a person
freely moves his body inside a specifically designed room. The developed algorithms can extrapolate
features from the human figure, such us spatial position, arms and legs angles etc. Through the developed
system the operator can link these features to sounds synthesized in real time, following a predefined
schema. The system latency is very low thanks to the use of Mac OS X native libraries (CoreImage,
CoreAudio). The resulting augmented interaction with the environment could help to improve the contact
with reality in young subjects affected by autism spectrum disorders (ASD).
1 INTRODUCTION
In the last years sensor based interactive systems for
helping the treatment of learning difficulties and
disabilities in children appeared on the specialized
literature (Ould Mohamed and Courbulay, 2006) and
(Kozima et al. 2005). These systems, like the quite
popular SoundBeam (Swingler and Price 2006),
generally consist of sensors connected to a
computer, programmed with special software which
reacts to the sensor’s data with multimedia stimuli.
The general philosophy of these systems is based
on the idea that even profoundly physically or
learning impaired individuals can become expressive
and communicative using music and sound
(Villafuerte et al., 2012). The sense of control which
these systems provide can be a powerful motivator
for subjects with limited interaction with reality. Our
research department has got a long tradition in
developing special gesture interfaces for controlling
multimedia generation, even if targeted to new
media art.
While systems like SoundBeam totally rely on
ultrasonic sensors, our system is based mostly on
real-time video processing techniques; moreover it is
easily possible to use an additional set of sensors
(e.g. infrared or ultrasonic). The use of video-
processing techniques adds more parameters which
can be used for the exact localization and details
about the human gestures to be detected and
recognized. By using the implemented software
interface, the operator can link these extracted video
features to sounds synthesized in real time,
following a predefined schema.
The proposed system has been experimented as
case study, in a real-patients test campaign over a set
of patient affected by autism spectrum disorder
(ASD), in order to provide them an increased
interaction towards external environment and trying
to reduce their pathological isolation (Riva et al.,
2013).
Following the case study testing on young
subjects, very positive results were obtained,
confirmed both by the professional therapists and the
parents of the patients. In particular the therapists
reported a positive outcome from the assisted
coaching therapies. Moreover this positive evolution
could bring an improvement in terms of transferring
the motivation and curiosity for the full
communication interaction in the external
environment, thus improving the well-being of the
subjects.
563
Magrini M. and Pieri G..
Real Time System for Gesture Tracking in Psycho-motorial Rehabilitation.
DOI: 10.5220/0004937205630568
In Proceedings of the International Conference on Health Informatics (SUPERHEAL-2014), pages 563-568
ISBN: 978-989-758-010-9
Copyright
c
2014 SCITEPRESS (Science and Technology Publications, Lda.)
2 SYSTEM METHODOLOGY
The system is installed in a special empty room,
with most of the surfaces (walls, floor) covered by
wood. The goal is to build a warm space which, in
some way, recall the prenatal ambient. All system
parts such as cables, plugs etc. are carefully hidden,
as they are potential elements of distraction for
autistic subjects. The ambient light is gentle and
indirect, also for avoiding shadows that can affect
the motion detection precision.
The whole system is based on an Apple
Macintosh computer (Figure 1), running the latest
version of Mac OS X. The video camera is
connected to the computer thru a firewire digitizer,
the Imaging Source DFG1394. This is a very fast
digitizer, which allows a latency of only 1 frame in
the video processing path. As an output audio card
we decided to use the Macintosh internal one, its
quality is superior to an average PC, more than
sufficient for our purposes. A couple of TASCAM
amplified loudspeaker completes the basic system.
For using additional sensors (infrared, ultrasonic) we
could add a simple USB board which digitize
analogic control signals translating them into
standard MIDI messages, easy to manage inside the
application.
We used the Mac OS platform for its reliability
in real time multimedia applications, thanks to its
very robust frameworks: Core Audio and Core
Image libraries permit very fast elaboration without
glitches and underruns.
Figure 1: Structure of the System.
3 THE INTERACTIVE SYSTEM
The software is a standalone application, and it is
obviously structured in different modules following
a strict C++ paradigm (Figure 2). The most
important modules are the Sequence grabber, which
manages the stream of video frames coming from
the video digitize, the Gesture tracking module,
which analyses the frames and extrapolate the
gesture parameters, and the Mapper, responsible of
the mapping between detected gesture parameters
and the generated sounds.
The bigger problem regarding the gesture control
of sounds is the latency, which is the delay between
the gesture and the correspondent effect on the
generated sound. Commercial systems like the
popular Microsoft Kinect could greatly simplify the
system but introduce a latency (around 100 ms) that
is not acceptable for our purposes. Our approach
guarantees the minimum latency for the adopted
frame rate, which is 40 ms at 25 FPS or 33.3 ms at
30 FPS.
Figure 2: Software Architecture Structure.
3.1 Graphical User Interface
Following the music-therapist specifications we
implemented the application graphical user interface
as a single window, with subfolders for specific
topics (Figure 3). In this way every aspect of the
system setup is quickly accessible to the operators
during the music-therapy sessions.
Figure 3: The appearance of the Graphical Interface.
HEALTHINF2014-InternationalConferenceonHealthInformatics
564
The upper area of the GUI contains the video
preview and the detected parameters monitor (see
Section 3.3 for details on the parameters), while the
lower one permits to setup the mapping between
parameters and generated sounds.
3.2 Image Processing
The incoming frame grabbed from the digitizer is
processed in several steps, in two alternative
modalities: area based or edge based (see Figure 4
top). In the first modality the segmentation process
is made on the full area areas, while the edge based
one it is based on the edge present in the grabbed
frame.
In both modes the image is firstly it is smoothed
with a Gaussian filter (fast computed thanks to the
Coreimage library). In the edge mode the image is
processed with an edge detection filter, too.
Then, we use a background subtraction technique
for isolate the human figure from the ambient.
Pressing the “Store background” button (obviously
with no human subjects in front of the camera) we
can store the background, area or edge based. When
the figure is present in front of the camera the
incoming frames are compared with the stored
background, using a dynamic threshold, obtaining a
binary matrix. The average threshold used in this
operation can be tuned by the operator using a
simple slider. It is not necessary to set again this
sensitivity if the ambient light does not change.
Finally we apply an algorithm for removing
unconnected small areas from the matrix, usually
generated by image noise. The final binary image is
then ready to be processed by the gesture tracking
algorithm.
Figure 4: Example of Image elaboration and its rendering.
The whole algorithm, executed for each
incoming frame can be described with this pseudo
code:
START: F0 = <grabbed frame>
F1 = GRAY_IMAGE(F0);
F2 = BLUR_FILTER(F1)
if (AREA_MODE)
F3 = F2
if (EDGE_MODE)
F3 = EDGE_FILTER(F2);
if (STORE_AS_BACKGROUND)
B = F3
F4 = F3-B
F5 = BINARIZE(F4)
F6 = ERODE(F5)
TRACK_MOVEMENT(F6)
goto START;
The frame resolution is 320x240, and a full frame
rate (e.g. 25 FPS) can be achieved because all the
image filters are executed by the GPU.
3.3 Gesture Tracking Algorithm
Starting from the binary raster matrix we apply an
algorithm to detect a set of gesture parameters. This
heuristic algorithm supposes that the segmented
image obtained by the imagine elaboration process is
a human figure, and tries to extrapolates some
features from it. This process is based on a
simplified model of the human figure (see Figure 4
bottom). Additional model for single parts of the
body (face, hands) are under development and it can
be used for more “zoomed” version of the system.
At the moment we can rapidly detect the position of
the head, the arms, the legs, and its evolution over
time. Starting from these five time dependent
positions we decided to compute the following
parameters:
Right Arm angle
Left Arm angle
Right Leg angle
Left Leg angle
Torso angle
Right Leg speed
Left Leg speed
Barycentre X
Barycentre Y
Distance of subject form camera
Their names are self-explaining. The distance from
the camera actually is an index related to the real
distance: it is computed as the ratio between the
frame height and the detected figure maximum
height. The leg speed is computed analyzing the last
RealTimeSystemforGestureTrackinginPsycho-motorialRehabilitation
565
couple of received frames; it is useful for triggering
sounds with “kick-like” movements. We also
compute these two additional parameters:
Global activity
Crest factor
The first is an indicator of overall quantity of
movement (0.0 if the subject is standing still with no
moments), while the second one is an indication of
the concavity of the posture: (0.0 means that the
subject is standing with the legs and the arms are
united with the body). Some optimizations are
performed in order to start the frame analysis from
an area centred in the last detected barycentre.
Instead of aiming at the design of a very
sophisticated detection algorithm, we tried to
implement it in a very optimized way for
maintaining the target frame rate (25 FPS), in order
to avoid latency between gestures and sounds.
3.4 Sound Generation
The sound generation is based on the Mac OS
CoreAudio library. We used the Audio Unit API for
building an Audio graph: 4 instances of
DownLoadable Synthesizer (DLS) are mixed
together in the final musical signal. These
synthesizers produce sounds according to standard
MIDI messages received from their virtual input
ports. We added two digital effects (echo and
Reverberation) to the final mix: for each synthesizer
we can control the portion if its signal to be sent to
these effects.
Each synthesizer module can load a bank of
sounds (in the DLS or SF2 standard format) from the
set installed in the system. The user can obviously
add his own sound banks, including the sounds he
created, to the system. It is also possible to specify a
background audio file, to be played together with the
controlled sounds.
3.5 Mapper
The mapper module translates the detected features
into MIDI commands for the musical synthesizers.
Each synthesizer works in independent way, and for
each of them it is possible to select the instrument
and the instrument banks.
Each parameter of the sounds (pitch, volume,
etc.) can be easily linked to the detected gesture
parameters using the GUI. For example we can link
the Global Activity to the pitch: the faster you move
the more high pitched notes you play. The
synthesized MIDI notes are chosen from a user
selectable scale: there’s a large variety of them,
ranging from the simplest ones (e.g. major and
minor) to the more exotic ones. As an alternative, it
is possible to select continuous pitch, instead of
discrete notes: in this way the linked detected
features controls the pitch in a “glissando” way.
Sound can be triggered in a “Drum mode” way,
too: the MIDI note C played when the linked
parameters reach a selected threshold.
All these links settings can be stored in pre-sets,
easily recallable and from the operators.
3.6 Parameters Summary
The detected parameters are shown in real-time with
a set of horizontal bars. Their shown value is
normalized between 0 and 1; in this way we found
that it is easy to understand their role in a link.
At the end of a session it is possible, pressing the
Statistics button, to show a simple Statistic of the
gesture parameters (currently the average and the
variance). These data, together with some other
useful information can be saved in a text file for
further external analysis.
4 RESULTS
AND DEVELOPMENTS
The installed system has been applied for case study
tests on several young patients affected by Autism
Spectrum Disorder.
Autism is a brain development disorder
characterized by impaired social interaction and
communication. It appears in the first years of life
and arrests the development of affective evolution. It
basically compromises social interaction and
language expression, and often leads to restricted
and repetitive behaviour. Various studies reports
about the incidence of autism, they all confirm a
large increase in the recent years, rising from 2002
with an increase rate of 78%. In particular on
average, but depending on the age of the data
retrieved autism affects about 1 children out of 100
(Baio, 2012) in the peak age (8 years old). The
following Figure 5 report the rapid increase and
trend (Chiarotti and Venerosi Pesciolini, 2012).
Autism has a genetic basis, but a complete
explanation of its causes is still unknown. An
exhaustive description of this disorder in medical
terms is beyond the scope of this document.
Studies have shown that music-therapy has a
significant, positive influence when used to treat
autistic individuals (Alvin and Warwick, 1992).
HEALTHINF2014-InternationalConferenceonHealthInformatics
566
Figure 5: Autism diagnosis rising.
Participating in music therapy allows autistics the
opportunity to experience non-threatening outside
stimulation, as they do not engage in direct human
contact. Music is a more universal language respect
to oral language, and it allows a more instinctive
form of communication.
The classic music-therapy is mostly based on
listening sounds and music, together with a therapist
(passive music-therapy). Our system, instead, is
active and interactive: providing an augmented
interaction with the environment it tries to remove
the subject from its pathological isolation.
The experimentation of the system with real
cases was performed during last year and half over
10 patients.
The therapists reported positive outcome
confirmed also by the parents of the young patients
involved. In particular there is a continuous and
positive trend, evolving from the first meetings till
the latest. One of the most evident positive changes
is the appearance of sight contact with the therapist
coupled with smiling and vocalization to imitate the
sounds or the voice of the therapist. In some case
will to verbal communication appeared in some
subjects that were lacking.
Moreover there has been also tentative of
imitative actions and own free initiatives from the
patients upon inputs from the therapists.
Particular relevance of the positive outcome of
the therapies, among others, can be given by the
following evolution signs:
Some semblance of game with the operator, signs
of understanding with his head to invite the
operator to repeat certain gestures;
Imitative action and / or free initiative by the
children on operator input;
Interest to the acoustic stimuli, with attempts to
adapt the movements and the gaming action
In general the experimentation is confirmed to be
particularly promising for a very important and
challenging goal: verify the conservation of the
improvements obtained within the case study setting
also in the external environment, transferring the
motivation and curiosity for the full communication
interaction in the real world.
Regarding the future development of the system,
there is an active plan to add a database to the
control software in order to store the patient’s data,
including all parameters statistics for the music
therapy sessions. In this way the therapists would
like to investigate the relationship between the
patient’s gesture evolution and his autistic disorders.
We are developing new models for the gesture
recognition module, specialized for single parts of
the body. For example, we would like to give the
possibility to concentrate the video camera only on
the patient’s face, detecting movements and
positions of eyes and mouth.
A 3D version of the system is also under study,
in this way it will be not necessary to stand in front
of the camera, but it could be possible to rotate
around the body axis, still capturing the correct arms
and legs angles.
5 CONCLUSIONS
We described an interactive, computer based system
based on real-time image processing, which reacts to
movements of a human body playing sounds. The
mapping between body motion and produced sounds
is easily customizable with a software interface. This
system has been used for testing an innovative
music-therapy technique for treating autistic
children. The experimentation performed of the
system with real cases was performed during last
period confirmed several benefits from the
application of the proposed system. These have been
confirmed both by the therapists and the parents of
the young patients. The most interesting outcome of
the testing was the improvement obtained which
could lead also to promising transfer of the attitudes
displayed in the test case study to the external
environment, in particular referring to the motivation
and curiosity for the full communication interaction
in the real world.
Moreover, during the case study testing, it was
found that along with treating autism spectrum
disorder, the system could be successfully used for
RealTimeSystemforGestureTrackinginPsycho-motorialRehabilitation
567
other diseases, such as Alzheimer and other
pathologies typical of older people.
ACKNOWLEDGEMENTS
The activity which brought to the implemented
system was partially sponsored by the Foundation
Cassa di Risparmio di Lucca. The authors would like
to thank Dr. Elisa Rossi for the precious contribution
given in the test case, and in the evaluation of
results.
REFERENCES
Baio, J., 2012. Prevalence of Autism Spectrum Disorders
– Autism and Developmental Disabilities Monitoring
Network, 14 sites, United States, 2008. In Morbidity
and Mortality Weekly Report (MMWR) Surveillance
Summaries, Centers for Disease Control and
Prevention, U.S. Department of Health and Human
Services, March 30, 2012, pp. 1-19, n. 61 (SS03).
Chiarotti, F., Venerosi Pesciolini, A., 2012. Epidemiologia
dell’autismo: un’analisi critica. In Congresso
Nazionale e Workshop formativi – Autismo e percorsi
di vita: il ruolo della rete nei servizi, Azienda ASL di
Ravenna, October 4-6, 2012, Ravenna, Italy.
Kozima, H., Nakagawa, C., Yasuda, Y., 2005. Interactive
robots for communication-care: a case-study in autism
therapy. In International IEEE Workshop on Robot
and Human Interactive Communication.
Ould Mohamed, A., Courbulay, V., 2006. Attention
analysis in interactive software for children with
autism. In Proceedings of the 8th international ACM
SIGACCESS conference on Computers and
accessibility. Portland, Oregon, USA.
Riva, D., Bulgheroni, S., Zappella, M., 2013.
Neurobiology, Diagnosis & Treatment in Autism: An
Update, John Libbey Eurotext.
Swingler, T., Price, A., 2006. The Soundbeam Project.
Alvin, J., Warwick, A., 1992. Music therapy for the
autistic child, Oxford University Press, USA.
Villafuerte, L., Markova, M., Jorda, S., 2012. Acquisition
of social abilities through musical tangible user
interface: children with autism spectrum condition and
the reactable. In Proceedings of CHI EA '12-CHI '12
Extended Abstracts on Human Factors in Computing
Systems, pp. 745-760, ACM New York, NY, USA.
HEALTHINF2014-InternationalConferenceonHealthInformatics
568