Behavioral Analysis Through Computer Vision: Detecting Emotions and

Hand Movements to Aid Mental Health

Aditya Gupta and Geetanjali Bhola

Department of Information Technology, Delhi Technological University, Rohini, Delhi, India

Keywords:

CNN, VGG-Model, Transformers, DeepFace, MediaPipe, RAF-DB.

Abstract:

This paper presents an advanced behavioral pattern analysis methodology developed for a mental health di-

agnosis task using the CNN model with an additional Attention Layer constituting an enhanced VGG Model.

To improve combined performance, a Transformer has been integrated into the CNN. The VGG-Transformer

model ﬁne-tunes and trains over 3,500 images from the dataset available in the RAF-DB dataset for optimal

feature extraction and classiﬁcation. Our sophisticated model uses a repository called DeepFace, with a VGG

model in it, followed by the MediaPipe library to process video inputs from the user. The system accurately

recognises various emotions and hand movements from frames, with a ﬁnal great training accuracy of 94.33%

and validation accuracy of 95.76%. The DeepFace integration supports detail and precision in the recognition

of facial emotions, while MediaPipe hand tracking enables deep hand movement analysis. Such a complemen-

tary approach will make one delve into understanding behavior in depth, where the results can aid in detecting

the possibility of mental health-related problems and supporting therapists in their diagnosis and treatment.

This work has proven that state-of-the-art machine learning in mental health assessment is required and has

been proven effective by combining the enhanced VGG16 Model and Transformers, DeepFace, and Medi-

aPipe specialised libraries.

1 INTRODUCTION

Mental health disorders are increasingly becoming a

global concern that has widely affected the quality of

life of individuals and further resulted in serious out-

comes, even mortality. The World Health Organiza-

tion claims that about 450 million people across the

world are affected by different mental health condi-

tions, which makes mental disorders one of the lead-

ing causes of ill health and disability in the world

(Gleason, 2023). Furthermore, more than 264 million

people are affected by various mental illnesses and are

a major contributor to the global burden of disease (Is-

lam et al., 2024). Suicide claims almost 700,000 lives

every year tragically and is the second leading cause

of death in both groups aged 15-29 years (Goueslard

et al., 2024).

Emotional health is an integral part of mental

wellness and signiﬁcantly determined by social inter-

action and relationship. Lockdown and social distanc-

ing during the COVID-19 pandemic severed those

connections that had increased isolation, thereby in-

creasing anxiety, stress, and depression among ado-

lescents. Social support systems grow weaker along-

side reduced emotional health, making young people

more at risk of falling to mental health issues and such

risky behavior as substance use. Social relationships

with family, friends, or adults trusted would provide

the sense of belonging and stress-buffering beneﬁts in

order to foster emotional resilience and prevent lone-

liness (Kwaning et al., 2023).

Disorders such as Autism Spectrum Disorder that

presents itself as affecting early development greatly,

though it appears in at least two of the following:

challenges with social/communicative behaviors and

restricted, repetitive patterns of behavior, interests, or

activities. Since children with ASD have serious im-

pairments in social integration, as well as emotional

regulation, the goal of the treatment is strengthen-

ing these areas from an early age. In comparison

with a healthy child, children with ASD are charac-

terized above all by motor and gestural communica-

tion deﬁcits: involuntary movements and stereotype

gestures. Detection of unusual gestures through hand

recognition could become signiﬁcant as an early in-

dicator for ASD, as it would allow one to understand

the underlying social and motor impairments and help

implement focused interventions to improve social

Gupta, A. and Bhola, G.

Behavioral Analysis Through Computer Vision: Detecting Emotions and Hand Movements to Aid Mental Health.

DOI: 10.5220/0013583500004664

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 3rd International Conference on Futuristic Technology (INCOFT 2025) - Volume 1, pages 671-682

ISBN: 978-989-758-763-4

671

engagement either at home or in school(Quintar et al.,

2025).

These are the issues we are aiming to tackle with

our technology taking anxiety, nervousness, depres-

sion, involuntary movements, stress, etc as the men-

tal health classes considered. Keeping in mind that

emotional health plays a major factor in investigating

mental health problems in people at a young age. The

young population goes through huge stress, is always

involved in work, and often does not get time to check

their emotional health. So, providing a method to de-

tect the emotional health of the individual with the

help of human-computer interaction on the grounds

of verbal(speech) and non-verbal(face, posture, text,

etc.) elements is building the future of health sector

of the growing concern of mental health around the

world to battle the stigma and encourage people to

take care of their emotional and mental health (Mishra

et al., 2023).

The hybrid architecture for the mental disorder de-

tection task makes use of YOLOv8 in detecting visual

cues related to mental disorders and combines Con-

volutional Neural Networks (Miao, 2021) with Vi-

sual Transformer models to form an ensemble clas-

siﬁer. This achieves an overall accuracy of about

81% in predicting mental illnesses like depression

and anxiety. This enhances transparency and in-

terpretability, whereby critical regions in the input

image that inﬂuence predictions can be highlighted

through Gradient-weighted Class Activation Mapping

and saliency maps. Integration that makes it eas-

ier to understand the system’s decisions increases

conﬁdence among health professionals in the results

and thus supports a more informed diagnosis pro-

cess (Aina et al., 2024). Through the years, neural

networks have improved based on the availability of

large annotated datasets, and continue to push devel-

opment in complex models capable of real-time anal-

ysis for dynamic behaviors. The VGG model was

proposed by Simonyan and Zisserman, 2014, which

was an important milestone for image classiﬁcation

tasks, considering that it had deep architecture and

utilised small convolutional ﬁlters. The VGG archi-

tecture is employed to train the dataset, speciﬁcally

using the VGG-16 CNN model developed by K. Si-

monyan and A. Zisserman. This model, renowned for

its performance, was one of the top submissions to

ILSVRC-2014, achieving a top-5 accuracy of 92.7%

on the ImageNet dataset (Kusumawati et al., 2022).

The robustness of VGG16 architecture (Kusumawati

et al., 2022) is fairly evident with its 96% accuracy

for detecting facial expressions in CK+ datasets based

on the learning and validating principle of Linear Re-

gression and K-Nearest Neighbour (KNN), Random

Forest algorithms to name a few (Zhang, 2020). Due

to the ease of the structure and powerful feature ex-

traction, the VGG model has been applied to many

aspects of computer vision tasks.

Although initially designed for NLP, transform-

ers have recently been adapted for vision tasks, re-

sulting in variants such as the Vision Transformers

(ViTs) (Berroukham et al., 2023). Because of their

high afﬁnity for capturing long-range dependencies

and contextual information, they are very ﬁtting for

analysing complicated patterns in video data. This

blend of the advantages of VGG models with trans-

formers was ﬁnally able to embed both strengths of

the architectures in feature extraction and classiﬁca-

tion accuracy.

Recent advances in machine learning and AI have

opened new opportunities for very early detection and

intervention of mental health issues. Key areas of

interest are emotion analysis and the study of body

movement. Researchers, with advanced models like

VGG-Transformer architecture, can now look into

minute details in facial expressions and physical ges-

tures that may indicate some sort of mental illness.

2 RELATED WORKS

Research in the area of detecting stereotypical move-

ments (SM) in children with autism, like arm ﬂap-

ping, spinning, and body rocking, from most works

so far underlines a huge demand for early interven-

tion. Traditionally, traditional methods for detecting

stereotypical movement have relied on manual obser-

vation, which is time-consuming and labor-intensive.

The need for this process has led to the develop-

ment of intelligent diagnostic tools. The Stereotypic

Movement Detection and Analysis software is one

such intelligent tool. Concerning this, SMDA per-

forms video pre-processing through computer vision

and utilizes a machine learning framework in con-

junction with the MediaPipe (Ma et al., 2022) library

MPL. This enables this software to assist in process-

ing recorded videos, labelling body part landmarks

like wrists, identifying their movement, and extract-

ing their movement on the X-Y coordinates during

SMs. The algorithm, speciﬁcally a Data Peak Filter-

ing Algorithm, is used in the software for estimating

the SMs regarding their intensities and frequencies.

Given SMDA’s fast detection capabilities and ﬂexi-

bility in the choice of LBPs, it is an effective diag-

nostic tool for any therapist to enhance efﬁciency and

accuracy in detecting SMs for treatment outcomes in

children with autism (Reddy et al., 2023).

Previous AEE research has investigated the use

INCOFT 2025 - International Conference on Futuristic Technology

672

of various sensor types in marketing, technical sys-

tems, and human-robot interaction. These studies

have carefully classiﬁed and analyzed sensors applied

in emotion detection-contactless methods, contact-

based sensors, and skin-penetrating electrodes-and

discussed in-depth their effectiveness in identifying

and measuring the type of an emotional state and its

intensity. This classiﬁcation of sensors has provided

important insights for the researcher in deciding on

the best methodology or exploring alternative possi-

bilities for emotion analysis based on the application

scope, expected results, and their inherent limitations.

The studies have also further suggested practical uses

that this idea could ﬁnd in taking the human-centered

aspects of the IoT and the related affective computing

frameworks forward (Dzedzickis et al., 2020).

While facial data were used previously for iden-

tity veriﬁcation, the characteristics of an eye blink

have been utilised in behavior monitoring. Interest-

ing to note the fact that a smart behavior biometrics

system was developed using LENET, ALEXNET, and

VGGNET network architectures; comparative analy-

sis has been done across such networks to see their ef-

fectiveness in continuous authentication (Reshma and

Jose Anand, 2023).

Many studies have employed variable-centered

methods to explore the relationship between psychi-

atric disorders and internet addiction (IA), using tech-

niques like regression analysis, factor analysis, and

structural equation modeling. These approaches fo-

cus on the correlations between variables and typi-

cally assume a homogenous sample, grouping indi-

viduals based on total score cut points, often over-

looking individual response patterns. In contrast,

person-centered methods, such as latent class analy-

sis (LCA), classify individuals into subgroups based

on their unique response patterns, offering more ac-

curate and insightful classiﬁcation. LCA has been

widely used in studying comorbid symptoms of de-

pression, anxiety, and IA, providing a more nuanced

understanding of the interrelationships among these

factors. Despite the growing research on these co-

morbidities, there remains a gap in exploring indi-

vidual differences in emotional patterns and IA using

person-centered approaches (Beneytez, 2023).

Restricted and repetitive patterns (RRP) involve

behaviors such as insistence on sameness, repetitive

sensory-motor actions, and sensory processing dif-

ferences, often emerging as self-regulation strategies

to reduce anxiety. These behaviors help individu-

als maintain control in unpredictable environments,

providing a mechanism for managing overstimula-

tion or increasing arousal in low-stimulation states.

RRP patterns are especially prominent in individuals

with autism spectrum disorder (ASD), where sensory

over-responsiveness correlates with heightened anxi-

ety, limited social adaptability, and hyper-focused at-

tention, while sensory under-responsiveness does not

typically link to anxiety (Gao et al., 2022). These

patterns are crucial for identifying early signs of

emotional dysregulation and mental health issues, as

they often signal underlying anxiety. By analyzing

emotional ﬂuctuations, tools like DeepFace can track

these patterns, providing a real-time understanding of

emotional states and their relationship to IA. Deep-

Face’s ability to detect emotional expressions and be-

havioral patterns enhances the detection of early signs

of comorbidity, guiding targeted interventions. In

combination with systems like MediaPipe, which can

identify unusual behaviors in real-time, these tools of-

fer a comprehensive approach to monitoring and sup-

porting mental health, facilitating early detection and

more effective prevention strategies.

Major factors concerning a healthy lifestyle

among students compiled from the processing of

experimental results include, in descending order:

emancipation from drug addiction (24.3%), sports ac-

tivity (15.7%), abstaining from alcohol and smoking

(11.4%), responsible sex life (9.8%), and nutrition

(7.4%). Other features identiﬁed include having a

meaningful life, good self-attitude, self-development,

and family relationships, among others. However,

they ranked lower compared to psychological health,

which implies that the students acknowledge the role

of psychological health in a holistic way of living.

This view comes along with new perspectives on

health, which emphasize the balance between phys-

ical well-being, internal harmony, and environmental

congruence (Yunusovich et al., 2022).

From the psycho-emotional health point of view,

the level of emotional instability among students

varies: 20.5% of students manifest a high level,

62.4% an intermediate one, and 17.1% a low level

of instability according to Eisenk’s personality inven-

tory. It is seen that there was not a notable difference

in distribution between the experimental and the con-

trol group because they are equal on all levels of emo-

tional stability. This is a signal that emotional regu-

lation and resilience must be included in the general

health components of education.

The system caters to the need for a user-friendly

solution in extracting emotion-related information

from facial expressions in real-time video streams.

This system processes videos uploaded by users,

detects facial emotions at each timestamp using

OpenCV, DeepFace (Srisuk and Ongkittikul, 2017),

and Streamlit, and then generates output annotated

with these emotions. OpenCV is responsible for pro-

Behavioral Analysis Through Computer Vision: Detecting Emotions and Hand Movements to Aid Mental Health

673

cessing the videos, DeepFace analyzes emotions, and

Streamlit provides an interactive user interface. It

effectively identiﬁes dominant emotions, quantiﬁes

their occurrences, and gives results. This has im-

proved human-computer interaction and contributed

much to the research in mental health (Bhanupriya

et al., 2023). DeepFace reaches an accuracy of

98.61% on LFW database while analysing the face

depth of the database samples(Srisuk and Ongkittikul,

2017).

The proposed algorithm addresses the challenge

of mental health by estimating addiction levels based

on three critical factors: depression, social anxiety,

and loneliness. This research focused on the anal-

ysis of these factors in a bid to gain some insight

into addictive behaviors. The study applied three

well-established machine learning algorithms: Logis-

tic Regression, Ridge Regression, and Support Vec-

tor Machine to assess the levels of addiction. In

the results among these algorithms, the Support Vec-

tor Machine turned out with the best performance of

91.68%. Addressing these factors of mental health

and measures for controlled engagement in mobile

gaming are some of the steps necessary to ensure im-

proved mental health among students (Chauhan et al.,

2023). Stress and negative emotions play a major

part in an imbalance of emotional and mental health

and since each person can show different facial ex-

pressions differently, it might not mean the emotion

that we perceive. This leads to errors in judgement.

Hence, the design of an emotional recognition sys-

tem using psychological signals to recognise emo-

tions designed a speciﬁc emotion induction experi-

ment to collect ﬁve physiological signals of subjects

including electrocardiogram, galvanic skin responses

(GSR), blood volume pulse, and pulse. This is run

with the help of a Support Vector Machine to study

the trend of negative emotions captured by the model

and achieved an accuracy of 89.1% (Chang et al.,

2013).

The hand tracking module outputs an array of size

21, where each value indicates whether a joint on the

left hand is visible. If all joints of a particular ﬁn-

ger are visible, that ﬁnger is considered open, and

its count is added to the total number of open ﬁn-

gers. The total number of open ﬁngers represents

the runs scored by the batsman. The trained convolu-

tional neural network achieved an accuracy of 0.9767

after ten epochs of training. Since MediaPipe is an in-

built framework, its accuracy is expected to be precise

(Teja Gontumukkala et al., 2022).

3 PROPOSED MODEL

3.1 Dataset Preprocessing

The model ﬁrst creates a complex preprocessing

pipeline for an image classiﬁcation model using Ten-

sorFlow and Keras (Bajpai and He, 2020). It uti-

lizes the ImageDataGenerator class to treat both

the training and test datasets with major transfor-

mations that enhance a model’s performance. For

the training set, images are rescaled by a factor of

1./255 for pixel value normalization in the range [0,

1]. Next, data augmentation is introduced through

a zoom range of 0.3 and random horizontal ﬂips,

which increases the variability of the training im-

ages and generalizes them more toward the unseen

test dataset. The RAF-DB dataset was chosen for

its nuanced portrayal of emotions (Achlioptas et al.,

2023) with colour and depth which makes it ”stand

from the crowd” (Weng et al., 2023) and is valu-

able to train our model to the utmost accuracy. Fi-

nally, images are fed into the preprocessing func-

tion keras.applications.vgg16.preprocess() input to be

adapted to the speciﬁcs of the VGG16 model. It

means processing the RAF-DB training set (Galea

and Seychell, 2022) in batches of 64 images, resiz-

ing them to 100x100 pixels, and one-hot encoding the

labels. For the test set, similar preprocessing is done

without augmentation to make it consistent with the

training data. The approach gives efﬁcient ways of

handling data, consistent preprocessing, and a robust

model for training and evaluation.

3.2 Network Architecture

This network architecture considers quite a sophis-

ticated model of image classiﬁcation, which com-

bines a custom TransformerBlock with a pre-trained

VGG16 architecture. For example, in this study, the

TransformerBlock (Zhang et al., 2024) is designed

with an embedding dimension of 512, 4 attention

heads, a feed-forward network dimension of 512, and

a dropout rate set to be 0.1. Multi-head attention is

combined with a feed-forward network that includes

ReLU activation, Layer Normalization, and Dropout

layers to ensure robust processing of features (Yan

et al., 2023). In the proposed model, the strongly

feature-extractive, pre-trained VGG16 is used without

its top layers, and its weights are kept non-trainable to

preserve the learned representations (Yamsani et al.,

2023).

The output from the VGG16 model is ﬂattened

and fed into a dense layer of 512 units with ReLU

activation. The reshaped output is further fed into

INCOFT 2025 - International Conference on Futuristic Technology

674

Figure 1: Phase 1 of Network Architecture

Figure 2: Phase 2.1 of Network Architecture

the TransformerBlock for additional feature represen-

tation enhancement from the attention mechanism.

Also, another multi-head attention layer is added with

4 heads and a key dimension of 512, further reﬁning

the features and adding extra depth and complexity in

the learning of model features.

These resultant features undergo global average

pooling to produce a vector of ﬁxed size, encapsu-

lating all the essential information. This is then fed

through two fully connected layers, each with 4096

units and ReLU activation, interspersed with dropout

layers at a rate of 0.5 to avoid overﬁtting. Class proba-

bilities corresponding to the 7 target classes are given

by the last layer, a dense layer with 7 units and soft-

max activation (Huang et al., 2024).

It is trained using the Adam optimizer with a

learning rate of 0.0001, where categorical cross-

entropy is used as the loss function and accuracy

as the metric (S¸en and

O-Zkurt, 2020). This ap-

proach has proven to be very successful in putting to-

gether the pre-trained feature extraction capabilities

of VGG16 with the very powerful contextual learn-

ing of Transformers and their attention mechanisms.

Another attention layer added to this enhances the

model’s ability to capture the most ﬁne-grained pat-

terns in the data and makes it a very powerful tool

while classifying images into rather diverse domains

of applications (Yuan et al., 2023).

Figure 3: Phase 2.2 of Network Architecture

Figure 4: Phase 2.3 of Network Architecture

Figure 5: Phase 2.4 of Network Architecture

3.3 Training and Validation

The dataset is now trained and validated on a com-

prehensive and robust training pipeline of an image

classiﬁcation model in TensorFlow and Keras. This

uses critical callbacks to increase training efﬁciency

and model performance. The ModelCheckpoint call-

back saves the model weights that have reached the

lowest validation loss so that the best model version

won’t be lost. The early stopping callback stops the

training process when validation loss hasn’t improved

during 3 epochs consecutively. It restores the best

weights to avoid overﬁtting. Also, the ReduceLROn-

Plateau callback reduces the learning rate by a factor

of 0.2 when the validation loss hasn’t improved for 6

epochs, thus helping in ﬁne-tuning the model. Real-

time training progress and detailed logs are managed

by TensorBoard and CSVLogger callbacks. It trains

the model over 50 epochs, with steps per epoch and

validation steps depending on the batch size of 64.

The training and test sets are preprocessed using the

preprocessing function of the VGG16 model to en-

sure format consistency for optimal input. This will

yield an impressive training accuracy of 94.33% and

a validation accuracy of 95.76%, thereby showing the

effectiveness of the callback strategy very well in the

whole training methodology.

Figure 6: Phase 3 of Network Architecture

Behavioral Analysis Through Computer Vision: Detecting Emotions and Hand Movements to Aid Mental Health

675

Figure 7: Phase 4 of Network Architecture

Figure 8: Phase 5 of Network Architecture

3.4 Performance Metrics

We calculate the holistic evaluation pipeline for an

image classiﬁcation model using TensorFlow and

Keras. This ranges from prediction to analysis of

performance to visualization. In the above code, the

trained model—fernet—makes a prediction on class

probabilities on the training set that are then converted

into class labels using np.argmax. Now, the class in-

dices are inverted from the test set to map indices to

class names to interpret the result. It uses sci-kit-learn

to output a confusion matrix and a classiﬁcation re-

port that includes most of the relevant performance

metrics. Emotions of each class are numbered as fol-

lows - Surprise (1), Fear (2), Disgust (3), Happy (4),

Sad (5), Angry (6), Neutral (7).

Keywords and Metrics:

1. Precision:

Precision

T P

+ FP

TP is the number of true positives for class i and FP

is the number of false positives for the class i.

2. Recall:

Figure 9: RAF-DB dataset and emotions (Left to Right) dis-

gust(3), sad(5), angry(6), surprise(1), neutral(7), happy(4),

fear(2)

Recall

T P

+ FN

TP is the number of true positives for class i and FN

is the number of false negatives for the class i.

3. F1-Score:

F1-Score

= 2 ×

Precision

× Recall

Precision

+ Recall

It is the harmonic mean of Precision and Recall.

4. Accuracy:

Accuracy =

TP + TN

TP + TN + FP + FN

TP is the total number of true positives, TN is the to-

tal number of true negatives, FP is the total number

of false positives and FN is the total number of false

negatives.

Table 1: Classiﬁcation Report of Training Set

Class Precision Recall F1-Score Support

1 0.89 0.81 0.92 329

2 0.91 0.96 0.99 74

3 0.78 0.86 0.88 160

4 0.91 0.93 0.96 1185

5 0.94 0.95 0.97 478

6 0.88 0.90 0.96 162

7 0.95 0.97 0.98 680

Accuracy 0.94 3068

Macro avg 0.92 0.93 0.94 3068

Weighted avg 0.93 0.94 0.94 3068

5. Macro Average:

Macro Precision =

∑

i=1

Precision

Macro Recall =

∑

i=1

Recall

Macro F1-Score =

∑

i=1

F1-Score

n is the number of classes.

6. Weighted Average:

Weighted Precision =

∑

i=1

(Precision

× Support

)

∑

i=1

Support

INCOFT 2025 - International Conference on Futuristic Technology

676

Figure 10: Model Loss and Training Accuracy Graph

Weighted Recall =

∑

i=1

(Recall

× Support

)

∑

i=1

Support

Weighted F1-Score =

∑

i=1

(F1-Score

× Support

)

∑

i=1

Support

Support is the number of true instances of each class

Table 2: Classiﬁcation Report of Test Set

Class Precision Recall F1-Score Support

1 0.98 0.96 0.94 64

2 0.92 0.87 0.99 64

3 0.89 0.95 0.92 64

4 0.88 0.92 0.93 64

5 0.96 0.91 0.94 64

6 0.95 0.96 0.89 64

7 0.94 0.93 0.90 64

Accuracy 0.95 448

Macro avg 0.94 0.95 0.95 448

Weighted avg 0.97 0.96 0.95 448

3.5 Application Model

The model will present state-of-the-art methodology

for emotion and hand movement expression anal-

ysis on video frames using computer vision and

deep learning technologies that put together OpenCV

(Bhanupriya et al., 2023), MediaPipe, and DeepFace

(Bhanupriya et al., 2023)(Awana et al., 2023)(Firman-

syah et al., 2023)(a weighted VGG Model) for com-

plete analysis. This would involve extracting frames

from a video at intervals of 30 frames and extracting

a dataset of 1,500 frames for a normal 5-minute video

at 30 fps. This ensures a constant temporal resolu-

tion for further analysis, where each frame is time-

stamped to enable accurate temporal mapping.

Hand landmark detection in the MediaPipe hand

model involves the identiﬁcation of spatial coordi-

nates of 21 key points on each hand in every frame

(Madrid et al., 2022). The accuracy rate realized

in detecting hand gestures using this model is more

than 95%, which is essential in understanding non-

verbal communication cues (Singhal et al., 2023).

Concurrently, DeepFace analyzes facial expressions

in every frame and then infers a probability of sev-

eral emotional states, such as happiness, sadness,

anger, and surprise. The DeepFace emotion detec-

tion model is reported to have an accuracy of 98% for

the identiﬁcation of basic emotions (Bhanupriya et al.,

2023)(Awana et al., 2023)(Firmansyah et al., 2023).

MediaPipe is an end-to-end, cross-platform

framework for the building of multimedia ML

pipelines, developed and open-sourced by Google

Research, applied in impactful applications, such as

Google Lens and ARCore along with Google Home.

This framework can efﬁciently take machine learn-

ing inference in many parts, all towards real time, on

server-side, mobile devices (including Android and

iOS), and also in the embedded systems, such as

Google Coral and Raspberry Pi, supporting on-device

inference with minimal latency (Fan, 2023).

As a framework for the processing of time-series

data, MediaPipe is very effective in augmenting pre-

trained models to detect hands, gestures, and move-

ments with precision, thus making the tool signiﬁ-

cantly good for the detection of atypical behaviors or

motor patterns. The module of hand tracking on Me-

diaPipe is robust in identifying spatial-temporal land-

marks in frames, thus allowing analysis with detail in

involuntary or repetitive gestures. This capability sig-

niﬁcantly enhances detection and tracking in behav-

ioral and neurological disorder studies, such as ASD,

providing reliable data points for assessing the poten-

tial diagnostic indicators.

The frames and their timestamps are then passed

through an intricate processing pipeline, where hand

Behavioral Analysis Through Computer Vision: Detecting Emotions and Hand Movements to Aid Mental Health

677

Figure 11: Flowchart of Application Model

movements are detected using hand landmarks and

classiﬁed—in the typical analysis, a movement is de-

tected in about 5% of frames. DeepFace outputs for

the facial expressions are returned with emotion prob-

abilities for each frame. For example, out of the sam-

ple analysis, happiness would have been detected in

61%, sadness in 12%, anger in 7%, neutrality in 2%,

fear in 8%, disgust in 1% and surprise in 4%.

This consolidated information can also be visual-

ized using Matplotlib. This will include bar charts of

the total scores for all emotions, as well as for the

number of times hand movements have occurred in

all frames in an analysis. For example, if the num-

ber of frames in an analysis is 1,500, then happiness

can contribute a total score of 600, sadness 375, anger

300, surprise 225, and hand movements are detected

in 900 frames. Line graphs will trace temporal evolu-

tion—that is, trends of emotion and hand movements

over time. The dual visualization will help understand

instant results and monitor changes across a video

timeline.

A ’DeepFace.analyze()’ function calculates ’emo-

tion scores’ by calculating the probability for each of

the emotion types detected on a face. When the func-

tion does its run emotion analysis, it relies on a pre-

trained CNN model to produce a sum of probabili-

ties equal to 1, where each probability represents the

likelihood of an emotion such as happiness, sadness,

or anger. For instance, in a result, ”happy” has 0.78;

other emotions have a smaller value, so ”happy” is the

most probable emotion occurring in that frame. Such

values allow for precise analysis of emotional expres-

sion frame by frame.

These scores of emotion prove helpful in the inter-

pretive process of ambiguous or subtle facial cues in

a quantitative manner. ‘DeepFace.analyze()‘ captures

the degree to which each emotion is present; these

can help systems to track the changes in emotional

intensity-a factor critical for user experience assess-

ments, social robotics, and monitoring mental health.

The Hands module of MediaPipe detects move-

ment within a frame resulting from hands in a video

by locating the landmarks that the system can use to

track and measure gestures or repeating actions over

time. To illustrate, each frame is parsed to identify

landmarks for a variety of hands detected, and it then

logs the positions of essential points that include ﬁn-

ger junctions and the wrist. It records hand move-

ment for that frame when landmarks are detected. It,

in turn, provides a binary score-for example, move-

ment is present or not-a score which reﬂects an event

of hand gestures.

MediaPipe offers a basic scoring system for the

examination of gestures. It tracks hand movement oc-

currence and continuity between frames, which can

be very useful in studies of behavioral patterns or dis-

orders such as ASD; speciﬁc gestures or repetitive

movements are indicators of behavioral traits.

Detailed logging and error handling make sure

that every frame is processed independently, ensur-

ing the robustness and reliability of the system. The

results are graciously handled for exception handling

to maintain the integrity of the analysis. It produces

an aggregate view of the detected emotions and hand

movements—a feature of critical importance in appli-

cations such as mental health monitoring. The sys-

tem’s excellent capability for detecting nuanced emo-

tional expressions and physical gestures in its input is

useful for behavioral analysis, human-computer inter-

action studies, and mental health diagnostics (Cheung

et al., 2020).

This model presents a technically sound method-

INCOFT 2025 - International Conference on Futuristic Technology

678

Figure 12: Test Video Frame 1

Figure 13: Test Video Frame 2

ology that can enable video content emotion and hand

movement analysis. Advanced computer vision tech-

niques, deep learning models, and detailed visualiza-

tion can provide an accurate and meaningful analy-

sis. The present methodology can make a signiﬁcant

impact on various ﬁelds by opening new avenues in

understanding and interpreting human behavior from

video analysis. The system provides detailed analysis

and visualization of emotions and hand movements,

which make it very useful for gaining valuable in-

sights. This thus serves as a signiﬁcant contribution to

research and practical applications in behavioral stud-

ies, human-computer interaction, and mental health

diagnostics (Varsha et al., 2021)(Gopalamma et al.,

2024).

Figure 14: Bar Graph of Models on tested Datasets vs their

Accuracies achieved

Figure 15: Test Video Frame 3

Figure 16: Test Video Frame 4

4 EXPERIMENTATION AND

RESULTS

To test, we fed in a video dataset for 1 minute and 3

seconds. Then it was reprocessed at 1 fps, so there

were a total of 63 frames in the video. Following Me-

diaPipe’s hand landmarks model, detection of hand

movement gave high-precision results by recognizing

21 key points on each hand in each frame. DeepFace

was used for facial expression recognition, assigning

probabilities to a range of emotions, including happi-

ness, sadness, anger, and surprise. Results indicated

the presence of happiness would have been detected

with an emotion score of over 2500, sadness over 200,

anger over 50, neutrality over 1300, fear over 1200,

disgust over 20 and surprise over 700. Hand move-

ments are observed in 2% of the frames, giving ev-

idence of their existence throughout the video. Vi-

sualizations were generated using Matplotlib; in this

context, bar graphs are used for total emotion scores

and hand movement frequencies, while line graphs

are used for temporal evolutions of features along a

video timeline. This experiment proved the system

to be competent in the correct analysis and visual-

ization of emotions and hand movements on a real-

world video dataset for behavioral analysis and men-

tal health monitoring (Gopalamma et al., 2024).

These results, shown in bar graphs, contributed to

an overview of all the cumulative emotion scores and

hand movement counts for the entire dataset of 1:03-

minute video. Summing up the detected emotions and

counting hand movements for every analysed frame,

it showed the total occurrences of each emotion and

hand movement. It was through the bar graph that a

total of these counts could be represented and show

which emotions are most and least detected, and gen-

erally at what degree hand movements occurred. This

is an excellent way of comparing the different emo-

Behavioral Analysis Through Computer Vision: Detecting Emotions and Hand Movements to Aid Mental Health

679

Figure 17: Bar Graph of Compiled Emotions and Hand

Movements detected in the video for their scores

Figure 18: Line Graph of Emotions and Hand Movements

detected in the video for their scores

tional and physical behaviors captured in this video,

which is very useful for behavioral and psychological

analysis. The importance of this bar graph is in the

clarity of the snapshot that it provides of dominant

emotions and behaviors, which can be very important

in identifying prevalent patterns and informing thera-

peutic interventions.

The results from the line graphs gave a detailed

temporal analysis, showing changes in every emotion

and hand movement through the video timeline. The

view was a graph of the emotion scores and hand

movement occurrences mapped frame by frame, pro-

viding a continuous record of how these behaviors

evolved through the course of the 1:03-minute video.

Peaks and troughs in this graph thus captured the mo-

ments of heightened or reduced emotional expression

and hand activity. This plot made it easier to iden-

tify particular time points at which changes in behav-

ior took place. It, therefore, improved the temporal

dynamics of emotions and physical movements cur-

rently under study in this video. This line graph is im-

portant because of its capacity to trace the progression

and variability of emotions and behaviors across time,

very vital in detecting the triggers and the context un-

der which the emotional and physical responses take

place.

This proposed training model which is the culmi-

nation of VGG16, Transformer Block and an Addi-

tional Attention Layer achieves 94.33% training ac-

curacy and 95.76% validation accuracy when trained

on RAF-DB dataset and the application model built

with DeepFace and MediaPipe achieves a better and

a full-proof results in comparison with the working of

SE-Resnet Model on RAF-DB dataset which achieves

the accuracy of 83.37% (Huang et al., 2023), Deep-

Face on LFW database which achieves 98.61% ac-

curacy (Srisuk and Ongkittikul, 2017), Transform-

ers which achieves 81% accuracy in predicting emo-

tions on YOLOv8 (Berroukham et al., 2023), VGG16

Model which achieves 92.7% accuracy on the Im-

ageNet dataset (Kusumawati et al., 2022) and Me-

diaPipe achieves an accuracy of 97.67% when used

in-built framework to analyse physical movements in

real-time video (Teja Gontumukkala et al., 2022).

5 CONCLUSION

In conclusion, VGG16 and Transformer models,

when combined with DeepFace and MediaPipe, pro-

duce a disruptive manner in which to carry out behav-

ioral analysis and diagnostics related to mental health.

We have obtained an accuracy value of a high degree

for the RAF-DB dataset using VGG16 feature extrac-

tion coupled with Transformer models, which further

process the features with their advanced attention ca-

pabilities. DeepFace was very good at detecting emo-

tions, while MediaPipe helped track hand movements

with very high accuracy, thus studying video frames

in detail for more reﬁned emotional and behavioral

understanding. It was through these state-of-the-art

technologies that the dataset derived from the 1:03

minute video returned nuanced emotional scores and

hand movement patterns on a frame- by-frame basis.

Line graphs and bar graphs resulted in a full view of

the emotional prevalence and temporal dynamics of

hand movements, providing critical information to-

ward understanding behavioral trends. It is at this

kind of granular level of detail that therapists are sup-

ported in the identiﬁcation of speciﬁc emotional states

and behavioral change, hence improving diagnosis.

The ability to detect and analyze the emotions and

hand movements so accurately allows for insight that

might other- wise be beyond reach into a patient’s

mental and emotional status. This reﬁned technique

makes it possible to get early warnings of impend-

ing mental health disorders while, at the same time,

allowing for the development of more individualized

and effective treatment strategies. At the end, there-

fore, this integration of such advanced technologies

in mental health diagnostics presents a very promis-

INCOFT 2025 - International Conference on Futuristic Technology

680

ing future driving changes in therapeutic practices for

better treatment results.

REFERENCES

Achlioptas, P., Ovsjanikov, M., Guibas, L., and Tulyakov, S.

(2023). Affection: Learning affective explanations for

real-world visual data. In 2023 IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 6641–6651.

Aina, J., Akinniyi, O., Rahman, M. M., Odero-Marah, V.,

and Khalifa, F. (2024). A hybrid learning-architecture

for mental disorder detection using emotion recogni-

tion. IEEE Access, 12:91410–91425.

Awana, A., Singh, S. V., Mishra, A., Bhutani, V., Kumar,

S. R., and Shrivastava, P. (2023). Live emotion detec-

tion using deepface. In 2023 6th International Con-

ference on Contemporary Computing and Informatics

(IC3I), volume 6, pages 581–584.

Bajpai, D. and He, L. (2020). Custom dataset creation

with tensorﬂow framework and image processing for

google t-rex. In 2020 12th International Conference

on Computational Intelligence and Communication

Networks (CICN), pages 45–48.

Beneytez, C. (2023). Intolerance-of-uncertainty and anxiety

as serial mediators between emotional dysregulation

and repetitive patterns in young people with autism.

Research in Autism Spectrum Disorders, 102:102116.

Berroukham, A., Housni, K., and Lahraichi, M. (2023).

Vision transformers: A review of architecture, ap-

plications, and future directions. In 2023 7th IEEE

Congress on Information Science and Technology

(CiSt), pages 205–210.

Bhanupriya, M., Kirubakaran, N., and Jegadeeshwari, P.

(2023). Emotiontracker: Real-time facial emotion de-

tection with opencv and deepface. In 2023 Interna-

tional Conference on Data Science, Agents & Artiﬁ-

cial Intelligence (ICDSAAI), pages 1–4.

Chang, C.-Y., Chang, C.-W., Zheng, J.-Y., and Chung, P.-C.

(2013). Physiological emotion analysis using support

vector regression. Neurocomputing, 122:79–87. Ad-

vances in cognitive and ubiquitous computing.

Chauhan, S., Mittal, M., Singh, H., Kumar, S., Goel, P.,

and Gupta, S. (2023). Predictive analysis on student’s

mental health towards online mobile games using ma-

chine learning. In 2023 IEEE 15th International Con-

ference on Computational Intelligence and Communi-

cation Networks (CICN), pages 321–324.

Cheung, D. K., Tam, D. K. Y., Tsang, M. H., Zhang, D.

L. W., and Lit, D. S. W. (2020). Depression, anxi-

ety and stress in different subgroups of ﬁrst-year uni-

versity students from 4-year cohort data. Journal of

Affective Disorders, 274:305–314.

Dzedzickis, A., Kaklauskas, A., and Bucinskas, V. (2020).

Human emotion recognition: Review of sensors and

methods. Sensors, 20(3):592.

Fan, Y. (2023). The improvements for the hands gesture

recognition based on the mediapipe. In 2023 2nd In-

ternational Conference on Data Analytics, Comput-

ing and Artiﬁcial Intelligence (ICDACAI), pages 748–

753.

Firmansyah, A., Kusumasari, T. F., and Alam, E. N. (2023).

Comparison of face recognition accuracy of arcface,

facenet and facenet512 models on deepface frame-

work. In 2023 International Conference on Com-

puter Science, Information Technology and Engineer-

ing (ICCoSITE), pages 535–539.

Galea, N. and Seychell, D. (2022). Facial expression recog-

nition in the wild: Dataset conﬁgurations. In 2022

IEEE 5th International Conference on Multimedia In-

formation Processing and Retrieval (MIPR), pages

216–219.

Gao, T., Liang, L., Li, M., Su, Y., Mei, S., Zhou, C., and

Meng, X. (2022). Changes in the comorbidity patterns

of negative emotional symptoms and internet addic-

tion over time among the ﬁrst-year senior high school

students: A one-year longitudinal study. Journal of

Psychiatric Research, 155:137–145.

Gleason, M. M. (2023). Editorial: It’s not just a phase, and

we know what to do: Children with early-onset men-

tal health concerns deserve care now. Journal of the

American Academy of Child & Adolescent Psychiatry.

Gopalamma, A., Patnaik, G. G., Karthik, P., Mohan, P.,

Venkatesh, K., and Joga, S. R. K. (2024). Analysis

of body language and detecting state of mind using

cnn. In 2024 International Conference on Intelligent

Systems for Cybersecurity (ISCS), pages 1–6.

Goueslard, K., Quantin, C., and Jollant, F. (2024). Self-

harm and suicide death in the three years follow-

ing hospitalization for intentional self-harm in adoles-

cents and young adults: A nationwide study. Psychia-

try Research, 334:115807.

Huang, B., Ying, J., Lyu, R., Schaadt, N. S., Klinkhammer,

B. M., Boor, P., Lotz, J., Feuerhake, F., and Merhof,

D. (2024). Utnetpara: A hybrid cnn-transformer archi-

tecture with multi-scale fusion for whole-slide image

segmentation. In 2024 IEEE International Symposium

on Biomedical Imaging (ISBI), pages 1–5.

Huang, Z.-Y., Chiang, C.-C., Chen, J.-H., Chen, Y.-C.,

Chung, H.-L., Cai, Y.-P., and Hsu, H.-C. (2023). A

study on computer vision for facial emotion recogni-

tion. Scientiﬁc Reports, 13(1):8425.

Islam, M. M., Hassan, S., Akter, S., Jibon, F. A., and

Sahidullah, M. (2024). A comprehensive review of

predictive analytics models for mental illness using

machine learning algorithms. Healthcare Analytics,

6:100350.

Kusumawati, D., Ilham, A. A., Achmad, A., and Nurtanio,

I. (2022). Vgg-16 and vgg-19 architecture models

in lie detection using image processing. In 2022 6th

International Conference on Information Technology,

Information Systems and Electrical Engineering (ICI-

TISEE), pages 340–345.

Kwaning, K., Ullah, A., Biely, C., Jackson, N., Dosanjh,

K. K., Galvez, A., Arellano, G., and Dudovitz, R.

(2023). Adolescent feelings on covid-19 distance

learning support: Associations with mental health,

social-emotional health, substance use, and delin-

quency. Journal of Adolescent Health, 72(5):682–687.

Behavioral Analysis Through Computer Vision: Detecting Emotions and Hand Movements to Aid Mental Health

681

Ma, J., Ma, L., Ruan, W., Chen, H., and Feng, J. (2022).

A wushu posture recognition system based on medi-

apipe. In 2022 2nd International Conference on Infor-

mation Technology and Contemporary Sports (TCS),

pages 10–13.

Madrid, G. K. R., Villanueva, R. G. R., and Caya, M. V. C.

(2022). Recognition of dynamic ﬁlipino sign language

using mediapipe and long short-term memory. In 2022

13th International Conference on Computing Com-

munication and Networking Technologies (ICCCNT),

pages 1–6.

Miao, G. (2021). Application of cnn-based face recognition

technology in smart logistics system. In 2021 20th

International Symposium on Distributed Computing

and Applications for Business Engineering and Sci-

ence (DCABES), pages 100–103.

Mishra, S., Surya, S. N., and Gupta, S. (2023). Emotional

intelligence: An approach to analyze stress using

speech and face recognition. In Computational Intel-

ligence in Analytics and Information Systems, pages

343–360. Apple Academic Press.

Quintar, N. A., Escribano, J. G., and Manrique, G. M.

(2025). How technology augments dance movement

therapy for autism spectrum disorder: A systematic

review for 2017–2022. Entertainment Computing,

52:100861.

Reddy, D. U., Kumar, K. P., Ramakrishna, B., and Sankar,

U. G. (2023). Development of computer vision based

assistive software for accurate analysis of autistic

child stereotypic behavior. In 2023 International Con-

ference on Recent Advances in Electrical, Electron-

ics, Ubiquitous Communication, and Computational

Intelligence (RAEEUCCI), pages 1–5.

Reshma, R. and Jose Anand, A. (2023). Predictive and com-

parative analysis of lenet, alexnet and vgg-16 network

architecture in smart behavior monitoring. In 2023

Seventh International Conference on Image Informa-

tion Processing (ICIIP), pages 450–453.

Singhal, R., Modi, H., Srihari, S., Gandhi, A., Prakash,

C. O., and Eswaran, S. (2023). Body posture correc-

tion and hand gesture detection using federated learn-

ing and mediapipe. In 2023 2nd International Confer-

ence for Innovation in Technology (INOCON), pages

1–6.

Srisuk, S. and Ongkittikul, S. (2017). Robust face recog-

nition based on weighted deepface. In 2017 Inter-

national Electrical Engineering Congress (iEECON),

pages 1–4.

Teja Gontumukkala, S. S., Sai Varun Godavarthi, Y., Ravi

Teja Gonugunta, B. R., and Palaniswamy, S. (2022).

Hand cricket game using cnn and mediapipe. In 2022

13th International Conference on Computing Com-

munication and Networking Technologies (ICCCNT),

pages 1–6.

Varsha, M., Ramya, M., Sobin, C. C., Subheesh, N., and

Ali, J. (2021). Assessing emotional well-being of stu-

dents using machine learning techniques. In 2021 19th

OITS International Conference on Information Tech-

nology (OCIT), pages 336–340.

Weng, S., Zhang, P., Chang, Z., Wang, X., Li, S., and Shi,

B. (2023). Affective image ﬁlter: Reﬂecting emo-

tions from text to images. In 2023 IEEE/CVF Interna-

tional Conference on Computer Vision (ICCV), pages

10776–10785.

Yamsani, N., Jabar, M. B., Adnan, M. M., Hussein, A.

H. A., and Chakraborty, S. (2023). Facial emotional

recognition using faster regional convolutional neural

network with vgg16 feature extraction model. In 2023

3rd International Conference on Mobile Networks and

Wireless Communications (ICMNWC), pages 1–6.

Yan, F., Yan, B., and Pei, M. (2023). Dual transformer en-

coder model for medical image classiﬁcation. In 2023

IEEE International Conference on Image Processing

(ICIP), pages 690–694.

Yuan, M., Lv, N., Xie, Y., Lu, F., and Zhan, K. (2023). Clip-

fg:selecting discriminative image patches by con-

trastive language-image pre-training for ﬁne-grained

image classiﬁcation. In 2023 IEEE International Con-

ference on Image Processing (ICIP), pages 560–564.

Zhang, M., Liu, L., Lei, Z., Ma, K., Feng, J., Liu, Z.,

and Jiao, L. (2024). Multiscale spatial-channel trans-

former architecture search for remote sensing image

change detection. IEEE Geoscience and Remote Sens-

ing Letters, 21:1–5.

Zhang, Q. (2020). Facial expression recognition in vgg net-

work based on lbp feature extraction. In 2020 5th In-

ternational Conference on Mechanical, Control and

Computer Engineering (ICMCCE), pages 2089–2092.

S¸en, S. Y. and

O-Zkurt, N. (2020). Convolutional neural

network hyperparameter tuning with adam optimizer

for ecg classiﬁcation. In 2020 Innovations in Intel-

ligent Systems and Applications Conference (ASYU),

pages 1–6.

Yunusovich, A. V., Ahmedov, F., Norboyev, K., and Za-

kirov, F. (2022). Analysis of experimental research

results focused on improving student psychological

health. International Journal of Modern Education

and Computer Science, 14:14–30.

INCOFT 2025 - International Conference on Futuristic Technology

682