Enhanced Face Reconstruction and Recognition System with
Audio-Visual Fusion
Prathika Muthu, Damodharan Asaithambi Ramani and Jenifer Arputham
Artificial Intelligence and Data Science, St.Joseph’s Institute of Technology,Chennai,Tamilnadu, India
Keywords: Audio-Visual Fusion, Face Reconstruction, Face Recognition, Local Binary Pattern (LBP), Radon Transform,
Autoencoder, Convolutional Neural Network (CNN)
Abstract: Deep Learning-Based Audio-Visual Fusion Approach for Enhanced Face Reconstruction and Recognition
System: A New Paradigm for Improving Accuracy of Face Reconstruction and Recognition. The challenging
factors in this area, namely illumination, pose, and expression, have been addressed by Local Binary Pattern
over Radon Transform audio feature extraction that are fused with visual data. The features are encoded with
an autoencoder while the CNN-based decoder reconstructs facial images of high quality from noisy or
incomplete data. This innovative system will improve the accuracy of recognition in any scenario, making it
valuable for forensic analysis, security, and adaptive user interfaces. Audio-visual fusion can be used to
perform holistic facial analysis, which is far beyond the traditional visual-only approach. Advanced neural
networks provide much better performance than existing approaches. Future extensions could include thermal
imaging, depth data, or real-time processing for dynamic environments. This system, based on deep learning
techniques, marks an important step in facial recognition technology with great potential applications across
various domains that require reliable and precise facial identification.
1 INTRODUCTION
Using audio descriptions and visual data to produce
better face reconstruction and identification accuracy,
the “Enhanced Face Reconstruction and Recognition
System Using Deep Learning with Audio-Visual
Fusion” is a paradigm leap in facial recognition
technology. Among the many serious drawbacks of
traditional facial recognition systems is their inability
to process visual data that is unclear, loud, or missing.
Their effectiveness is hampered by these limitations
in situations with different lighting conditions,
postures, and facial expressions—all of which are
crucial for real-world applications like security and
forensic investigation. To address these challenges,
the proposed system integrates audio and visual
inputs for a holistic analysis of facial features.
Essential contextual information is provided by audio
data, which is frequently underused in facial
recognition. Because of its resilience in identifying
directional patterns and textures in sound waves, the
Local Binary Pattern over Radon Transform (LBRP)
is used to extract significant features from audio
descriptions. To improve the portrayal of face
characteristics, these traits are combined with visual
information. A sophisticated deep earning framework
is used in the system architecture. High dimensional
face traits are encoded by an autoencoder technique,
which guarantees effective compression while
maintaining important data. From the encoded data, a
CNN-based decoder that uses transposed convolution
reconstructs high-fidelity face pictures. Transposed
convolution was chosen in particular because it can
efficiently up sample features while preserving
spatial consistency and guaranteeing high-quality
reconstruction even when inputs are noisy or
insufficient. By combining audio-visual data, the
system is in a unique position to perform better than
conventional techniques and adjust to difficult
situations such different lighting, postures, and facial
expressions. Among the contributions of the system
are utilizing audio-visual fusion to overcome the
shortcomings of conventional technologies.
Presenting LBRP, an efficient method for extracting
audio features that enhance visual data .utilizing a
strong architecture that combines CNNs and
autoencoders to achieve accurate reconstruction.
Future improvements can include using bigger and
more varied datasets, enabling real-time processing in
dynamic contexts, and integrating speech patterns
Muthu, P., Asaithambi Ramani, D. and Arputham, J.
Enhanced Face Reconstruction and Recognition System with Audio- Visual Fusion.
DOI: 10.5220/0013734400004664
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 3rd International Conference on Futuristic Technology (INCOFT 2025) - Volume 3, pages 887-894
ISBN: 978-989-758-763-4
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
887
and emotional tones with auditory data. These
developments will improve the system’s functionality
even more and broaden its use in fields that demand
accurate and dependable facial recognition.
2 RELATED WORKS
A diffeomorphic volume-to-slice registration
approach with a deep generative prior to address
motion artifacts in prenatal MRI, achieving robust
volumetric reconstruction. Validated on 72 fetal
datasets (20–36 weeks gestation), it outperformed
state-of-the-art techniques with a mean absolute error
of 0.618 weeks and = 0.958 for gestational age
prediction, with accuracy further enhanced by
combining brain and trunk data. (Grande, et al. ,
2023)Benefits include superior image quality and
comprehensive fetal analysis, while limitations
involve high computational complexity and the need
for broader validation across diverse imaging
conditions. Using min-max concave (MC) penalties
for unbiased sparse constraints and total variation
(TV) for uniform intensity, it suggests a nonconvex
regularization technique for Magnetic Particle
Imaging (MPI). The method improves reconstruction
accuracy by employing an alternate direction method
of multipliers (ADMM) and a two step parameter
selection process. (Zhu, et al. , 2024)
It decreased intensity error from 28 percent to 8
percent when tested on OpenMPI, simulations, and
hand-help scanner data. While there are benefits like
better picture quality and accurate quantitative
characteristics, there are drawbacks including
computational complexity and the requirement for
more extensive real-world validation.By integrating
image priors, kernelized expectation maximization
(KEM) aids in the difficult task of reconstructing low-
count PET data. In order to improve reconstruction,
this work presents implicit regularization using a
deep coefficient prior, which is represented by a
convolutional neural network. To ensure monotonic
likelihood improvement, the suggested neural KEM
method alternates between a deep-learning phase for
updating kernel coefficients and a KEM step for
image updates. It performed better than conventional
KEM and deep image prior techniques, as confirmed
by simulations and patient data. (Gong, Badawi, et al.
, 2023)
Improved reconstruction accuracy and effective
optimization are benefits; nevertheless,
computational complexity and the requirement for
further clinical validation are drawbacks. Positronium
lifetime (PLI), which is impacted by tissue
microenvironments, is captured by Positron Emission
Tomography (PET) imaging, providing information
on the course of illness. A statistical image
reconstruction technique for high-resolution PLI is
presented in this work, which includes a correction
for random triple coincidence occurrences that is
essential for real-world uses. The technique may
provide life time pictures with high accuracy, low
variation, and resolution similar to PET activity
images utilizing the existing time of flight resolution,
as shown by simulations and experimental
investigations. (Guan, et al. , 2024).
Figure 1: Face Recognition
3 METHODOLOGY
3.1 Dataset
The ”Labeled Faces in the Wild” (LFW) dataset is a
widely used benchmark for studying unconstrained
face recognition.
It organizes images into folders labeled by
individual names, with each folder containing
samples of that person. Captured in real-world
conditions, the dataset presents challenges such as
varying lighting, poses, and occlusions. Aligned
facial land marks, including the eyes, nose, and
mouth, ensure uniformity enhancing the performance
of deep learning models. LFW is particularly valuable
for tasks like face verification and person
reidentification as many individuals have multiple
images.it includes diverse facial expressions and
angles, making it ideal for robust model training and
evaluation.
Table 1:Dataset Statistics
S.No Name No.of Images
1 LFW 13233
2 CELEBA 202599
INCOFT 2025 - International Conference on Futuristic Technology
888
3.2 DataCollection:
Images of faces are collected from datasets that
include images of various poses, lighting conditions,
expressions, and even occlusions: CelebA, LFW,
CASIA-WebFace. Audio descriptions include sound
and pitch; timestamps are provided and aligned to
corresponding facial features in the collection so that
every audio feature would correspond to a
corresponding video frame even in cases of dynamic
scenarios. For example, mapping audio descriptors
like pitch and energy to the properties of video frames
adds up to the accuracy. Data validation comes in to
ensure that the data is well-organized and of good
quality, ensuring meaningful results from the
analysis.It ensures proper integration of audio and
visual information.
3.3 Data Preprocessing:Audio-Video:
Resizing: Resizing will ensure all images fed into a
machine learning model have an equalized pixel
resolution. This is quite crucial in providing
consistency on all fronts. Preprocessing through
resizing yields images with identical
dimensionalities, which is helpful for the model.
However, resizing can sometimes alter the aspect
ratio, and this is retained to minimize distortion
further. Some common ones are the nearest
neighbour, bilinear, and bicubic. Resizing
standardizes the input but data loss will also be at a
greater risk, especially if the images get compressed.
Normalization: The homogeneity of normalizing
pixel values within a standard range of 0 to 1 or -1 to
1 improves model performance during pre-
processing. This gives a fast convergence, avoids
instability at any possible point, and provides equal
contribution of all pixels.
Data Augmentation: Rotating images by, for
instance ±15° or ±30° forces the model to detect
objects without regard to their angle. The horizontal
or vertical flip allows the model to handle elements
reflected over one axis. Shifting image along both
axes x and y improves the model’s ability to identify
objects at various positions, thus position-invariant.
Noise Reduction:The process of audio
preprocessing ensures that noise removal takes place,
thus ensuring that there is clear feature extraction.
Amongst some of the techniques which have been
used to that noise removal takes place, thus ensuring
that there clear feature extraction. Amongst some of
the techniques which have been used to that noise
removal takes place, thus ensuring that there clear
feature extraction have been used to reduce unwanted
frequencies are: spectral subtraction, band-pass
filtering, and high/low-pass filtering. wavelet
denoising, which clean the audio signal and post
Denoising procedures include Wiener filtering and
processing smoothing, which helps prevent artifacts.
To make sure that every audio feature matches the
corresponding visual frame, audio and visual inputs
are timestamped and along throughout data
collection. Pitch and energy are examples of audio
descriptors that are translated to the temporal
properties of the video frames in dynamic situations.
Feature Extraction using LBRP: The Local
Binary Radon Pattern technique is a process where
audio features are extracted through local textures
and directional patterns. Similar to this, it compares
short frames of audio signals that capture how the
energy of the sound changes along time and applies
the Radon transform to determine the shift in
directions. Then, it assists in correlating the auditory
cues to visual data ; this performs better
reconstruction of faces from audio descriptions.
Figure 2: Component Diagram
3.4 Decoding of Encoder
Input Layer: The input to the encoder is a high-
dimensional data vector, such as a facial image
represented by pixel values.Let the input data be
denoted as:
x ∈𝑅
(1)
where n is the dimensionality of the input data
(e.g., the number of pixels in an image).
Fully Connected/Convolutional Layers: In an
encoder that has a deep learning approach, the input
passes through several layers, all of which are fully
connected or convolutional. These layers apply
transformations to learn feature representations. Let’s
Enhanced Face Reconstruction and Recognition System with Audio- Visual Fusion
889
take the fully connected layer, where the
transformation is given by:
h = f (W x + b) (2)
where:
h is the hidden layer (compressed feature
representation),
W is the weight matrix of the layer,
b is the bias vector,
f (·) is an activation function such as ReLU
(Rectified Linear Unit).
For a convolutional layer, the transformation
involves convolution operations:

(
)
=𝑓
∑∑
𝑊

()
𝑥
.


+𝑏
()


(3)
where:
𝑊
()
is the convolution kernel (filter) of size
M × N ,
𝑥
,
is the local patch of the input
centered at (i, j),
f (·) is the activation function (e.g., ReLU).
Pooling/Downsampling Layers: Pooling layers
are used to reduce the dimensionality and focus on the
most important features. A common type of pooling
is max pooling, where the transformation is given by:


=max
,
(4)
This operation reduces the spatial dimensions by
taking the maximum value from a patch of the feature
map, which decreases the resolution but preserves
significant features. Bottleneck Layer: By
condensing high-dimensional inputs into a single
latent space, the autoencoder’s bottleneck layer
efficiently captures the combined representation of
audio and visual characteristics. Important aspects of
both senses are combined, maintaining connections
like the way some auditor signals correspond with
visual patterns. Even with noisy or incomplete data,
this latent representation guarantees reliable encoding
of crucial, complementary information, allowing for
precise reconstruction. It can be mathematically
represented as:
𝑧=𝑓(𝑊
ℎ+𝑏
) (5)
where:
z is the low-dimensional embedding or
latent space representation of the input,
𝑊
is the weight matrix of the bottleneck
layer,
𝑏
is the bias vector of the bottleneck
layer,
f (·) is an activation function (e.g.,
ReLU).
Figure 3: System Architecture
3.5 Decoding of CNN
In a Convolutional Neural Network (CNN)- based
decoder architecture, the decoder reconstructs an
image from a com pressed representation (often
produced by an encoder or some fused features).
Input from Encoder (Compressed Features):
The decoder takes the compressed feature map from
the encoder. This compressed data encapsulates
important high-level features of the original image.
Transposed Convolution Layers
(Deconvolution):The core part of a CNN decoder is
the transposed convolution layers. These layers are
used to upsample the compressed feature map to a
higher resolution, typically back to the size of the
original image. The output dimensions of a
transposed convolution layer can be computed using
the following formula:
𝐻

=
(
𝐻

−1
)
×𝑆+𝐾−2𝑃 (6)
𝑊

=
(
𝑊

−1
)
×𝑆+𝐾−2𝑃 (7)
where:
𝐻

and 𝑊

are the height and width
of the output feature map,
𝐻

and 𝑊

are the height and width of
the input feature map,
S is the stride,
K is the kernel size,
P is the padding applied.
The transposed convolution layers gradually
increase the resolution, reconstructing the spatial
structure of the image.
ReLU Activation Function: After each
transposed convolution layer, the ReLU (Rectified
Linear Unit) activation function is typically applied
to introduce non-linearity, helping the decoder learn
complex patterns The function is defined as:
INCOFT 2025 - International Conference on Futuristic Technology
890
f(x)=max(0,x) (8)
where x is the input. This ensures that only
positive values are passed on, effectively handling the
non-linearity of the data.
Final Convolution Layer: The final layer of the
CNN decoder is typically a convolution layer with a
sigmoid activation function, which maps the feature
maps to the correct number of channels (for example,
1 for grayscale images or 3 for RGB images).
𝜎
(
𝑥
)
=


(9)
This function normalizes the output pixel values
between 0
and 1.
Loss Calculation (Reconstruction Error): The
reconstructed image is compared to the original
image using a loss function like Mean Squared Error
(MSE).
𝑀𝑆𝐸 =
(𝑦

𝑦
)

(10)
where:
𝑦
is the true pixel value,
𝑦
is the predicted pixel value.
The MSE measures the difference between the
original and
the reconstructed image.
3.6 Face Recognition
Feature Embedding: In CNN-based face
recognition, after the encoder extracts features from
the input image, the features are mapped into a fixed-
length embedding vector. This embedding represents
the unique characteristics of a face, enabling com
parison across different images.We define the output
of the fully connected (FC) layer as:
e = FC (f (x)) = W · f (x) + b (11)
where f (x) represents the features extracted by the
CNN encoder from the input image x, W is the weight
matrix, and b is the bias vector.
Similarity Measurement: To determine whether
two faces are similar (or belong to the same person),
we compute the similarity between their embedding
vectors. Two commonly used similarity metrics are:
a) Cosine Similarity: The cosine similarity
between two embedding vectors e1 and e2 is given
by:
𝑆

(
𝑒
,𝑒
)
=
.
‖‖
(12)
where e1 and e2 are two embedding vectors and
e represents the magnitude (L2 norm) of vector e.
b) Euclidean Distance: The Euclidean distance
between two embedding vectors e1 and e2 is given
by:
𝑑

(
𝑒
,𝑒
)
=
𝑒
−𝑒
=
(
𝑒

−𝑒

)

(13)
where e1i and e2i are the components of the
embedding vectors e1 and e2, respectively.
The smaller the Euclidean distance, or the closer
the cosine similarity is to 1, the more similar the two
embeddings, and thus, the more likely they represent
the same individual.
3.7 Classifiation
Once the similarity score (cosine similarity or
Euclidean distance) is obtained, the next step is to
classify the identity of the individual.
Softmax Function: When you have multiple
classes (identities), you can use a softmax activation
to convert similarity scores into probabilities. The
identity with the highest probability is selected as the
predicted class. The softmax function is defined as:
𝑃
=
(14)
where zi is the similarity score for class i, and
𝑒
is the sum of the exponentials of similarity
scores over all classes.
The identity corresponding to the highest 𝑃
is
chosen as the predicted class.
Sigmoid Function (for Binary Classification):
If the goal is to classify whether the face matches a
specific identity (binary classification), the sigmoid
activation function can be used:
𝑃
=


(15)
where zi is the similarity score. The output will be
a value between 0 and 1, representing the probability
that the face matches the given identity. A value
closer to 1 indicates a match, while a value closer to
0 indicates no match.
Enhanced Face Reconstruction and Recognition System with Audio- Visual Fusion
891
3.8 Training process
To teach the Autoencoder and CNN models, we feed
them data and use specific metrics to see how well
they’re learning.
For the Autoencoder, we measure how closely the
output matches the input using a ”Mean Squared
Error” measure. For the CNN, which focuses on
recognizing patterns, we use a Cross-Entropy”
measure to assess how well it’s making predictions.
3.9 Fine Tuning Process
We experiment with different settings, such as how
fast the model learns (learning rate), how many data
points we process at a time (batch size), and the
structure of the model itself. This tweaking helps us
improve the model’s performance.
4 PERFORMANCE METRICS
Accordingly, various performance indicators are used
to evaluate the effectiveness of the suggested deep
learning system for malignant cell detection.
Accuracy: Accuracy represents how frequently
the model correctly classifies instances as cancerous
or not. It is calculated based on true positives (TP),
true negatives (TN),
false positives (FP), and false negatives (FN).
The formula for accuracy is given by:
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =


(16)
Precision: Precision indicates how many of the
instances that the model revealed as positive, or
cancerous, are actually correct. It measures the
accuracy of the model in predicting positive cases.
The formula for precision is:
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =


(17)
Recall(Sensitivity): Recall, also known as
sensitivity, quantifies how well the model identifies
actual positive cases. It displays the ratio of true
positives to the total number of actual positives
(TP+FN):
𝑅𝑒𝑐𝑎𝑙𝑙 =


(18)
F-1 score: The F1-Score is the harmonic mean of
recall and precision. It is particularly useful in cases
where there is a class imbalance. The formula for F1-
Score is:
𝐹1 − 𝑆𝑐𝑜𝑟𝑒 = 2 ×
×
×
(19)
ROC-AUC: The Receiver Operating
Characteristic curve (ROC) plots the true positive rate
(recall) against the false positive rate at different
threshold values. The Area Under the Curve (AUC)
is a summary measure of how well the model
distinguishes between classes. A higher AUC value
indicates better performance.
5 RESULT
Integration of both auditory and visual data enables
deep learning techniques to advance methods of face
reconstruction along with detection. The system will
require specific hardware that involves features of a
GPU that has CUDA support,like the NVIDIA RTX
series, 16GB RAM to process data, high-speed SSD
for holding big datasets, and a multicore CPU such as
Intel i7 or AMD Ryzen series to carry out
preprocessing and inference tasks.
It consists of three major datasets: Labeled Faces
in the Wild (LFW), with 13,233 images,and CelebA
with 202,599 images, and CASIA-WebFace, all of
which are used as training data sets to achieve
diversity and robustness in the model.
It should have Python 3.x as its primary
programming language, along with the installation of
TensorFlow or PyTorch to create a deep learning
model and train it; OpenCV to pre-process the image;
Librosa to extract audio features; and NumPy, Pandas
to handle the data.It uses Local Binary Pattern over
Radon Transform or LBRP for extracting audio
features and combines this with a visual. The system,
besides that, overcomes problems due to pose
variations, variability in lighting, and inadequate or
noisy input data as well.It applies an autoencoder for
the efficient encoding of features and a CNN-based
decoder for reconstructing images with good
quality.The system is compatible with Linux (for
example, Ubuntu 20.04) or Windows 10/11.
Tools such as Jupyter Notebook or Google Colab
are used for development and experimentation.
Version control is ensured using Git, and
environment replication is made easier using
Docker.This leads to a significant improvement in the
accuracies of facial reconstruction and recognition
estimated to be between 90%and 95%, effectively
INCOFT 2025 - International Conference on Futuristic Technology
892
making it suitable for deployment in forensic
analysis, security, and real-time adaptive interfaces.
Future development of the system may include
enhancements in real-time processing of images,
increased datasets, and the inclusion of advanced
auditory cues like speech patterns, emotional tones,
etc, to enhance the accuracy as well as generalization.
Figure 4 : Activity Diagraam
6 CONCLUSION
The proposed audio-visual fusion system greatly
enhance face reconstrction and recognition as it is
capable of mitigating the limitations of traditional
approaches: noisiness, incompleteness, or
inconsistency in data. However, there are several
limitations to its current applications such as a
dependence on high computational resources,
possible bias because of a lack of diversity of data
sets, and non-real-time applicability in processing.
Future research should be directed toward integrating
larger and diverse datasets, incorporating higher level
auditory cues such as tones of emotion and speech,
and optimization of architecture with respect to real-
time systems. Another direction can also be multi-
modal data fusion and edge computing; however,
emerging technologies should find their ways to
further optimize systems in terms of efficiency,
accuracy, and adaptability across real-world
scenarios.
REFERENCES
Cordero-Grande, L., et al. (2023). Fetal MRI by robust deep
generative prior reconstruction and diffeomorphic
registration. IEEE Transactions on Medical Imaging,
42(3), 810-822.
Zhu, T., et al. (2024). Accurate concentration recovery for
quantitative magnetic particle imaging reconstruction
via nonconvex regularization. IEEE Transactions on
Medical Imaging, 43(8), 2949-2959.
Li, S., Gong, K., Badawi, R. D., Kim, E. J., Qi, J., & Wang,
G. (2023). Neural KEM: A kernel method with deep
coefficient prior for PET image reconstruction. IEEE
Transactions on Medical Imaging, 42(3), 785-796.
Guan, Y., et al. (2024). Learning-assisted fast
determination of regularization parameter in
constrained image reconstruction. IEEE Transactions
on Biomedical Engineering, 71(7), 2253-2264.
Huang, B., et al. (2024). SPLIT: Statistical positronium
lifetime image reconstruction via time-thresholding.
IEEE Transactions on Medical Imaging, 43(6), 2148-
2158.
Fan, H., et al. (2024). High accurate and efficient 3D
network for image reconstruction of diffractive-based
computational spectral imaging. IEEE Access, 12,
120720-120728.
Salomon, A., Goedicke, A., Schweizer, B., Aach, T., &
Schulz, V. (2011). Simultaneous reconstruction of
activity and attenuation for PET/MR. IEEE
Transactions on Medical Imaging, 30(3), 804-813.
Zhou, S., Deng, X., Li, C., Liu, Y., & Jiang, H. (2023).
Recognition-oriented image compressive sensing with
deep learning. IEEE Transactions on Multimedia, 25,
2022-2032.
Mohana, M., & Subashini, P. (2023). Emotion recognition
using deep stacked autoencoder with softmax classifier.
2023 Third International Conference on Artificial
Intelligence and Smart Energy (ICAIS), 864-872
Abdolahnejad, M., & Liu, P. X. (2022). A deep autoencoder
with novel adaptive resolution reconstruction loss for
disentanglement of concepts in face images. IEEE
Transactions on Instrumentation and Measurement, 71,
1-13
Bragin, A. K., & Ivanov, S. A. (2021). Reconstruction of
the face image from speech recording: A neural
networks approach. 2021 International Conference on
Quality Management, Transport and Information
Security, Information Technologies (ITQMIS), 491-
494.
Gao, Y., Gao, L., & Li, X. (2021). A generative adversarial
network based deep learning method for low-quality
Enhanced Face Reconstruction and Recognition System with Audio- Visual Fusion
893
defect image reconstruction and recognition. IEEE
Transactions on Industrial Informatics, 17(5), 3231-
3240.
Zhu, Y., Cao, J., Liu, B., Chen, T., Xie, R., & Song, L.
(2024). Identity-consistent video de-identification via
diffusion autoencoders. 2024 IEEE International
Symposium on Broadband Multimedia Systems and
Broadcasting (BMSB), 1-6
Zheng, T., et al. (2024). MFAE: Masked frequency
autoencoders for domain generalization face anti-
spoofing. IEEE Transactions on Information Forensics
and Security, 19, 4058-4069.
Damer, N., Fang, M., Siebke, P., Kolf, J. N., Huber, M., &
Boutros, F. (2023). MorDIFF: Recognition
vulnerability and attack detectability of face morphing
attacks created by diffusion autoencoders. 2023 11th
International Workshop on Biometrics and Forensics
(IWBF), 1-6.
Afzal, H. M. R., Luo, S., Afzal, M. K., Chaudhary, G.,
Khari, M., & Kumar, S. A. P. (2020). 3D face
reconstruction from single 2D image using distinctive
features. IEEE Access, 8, 180681-180689.
Tu, X., et al. (2021). 3D face reconstruction from a single
image assisted by 2D face images in the wild. IEEE
Transactions on Multimedia, 23, 1160-1172.
Chen, Y., Wu, F., Wang, Z., Song, Y., Ling, Y., & Bao, L.
(2020). Self-supervised learning of detailed 3D face
reconstruction. IEEE Transactions on Image
Processing, 29, 8696-8705.
Sun, N., Tao, J., Liu, J., Sun, H., & Han, G. (2023). 3-D
facial feature reconstruction and learning network for
facial expression recognition in the wild. IEEE
Transactions on Cognitive and Developmental Systems,
15(1), 298-309.
Ozkan, S., Ozay, M., & Robinson, T. (2024). Texture and
normal map estimation for 3D face reconstruction.
ICASSP 2024 - 2024 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP),
3380-3384.
Tu, X., et al. (2022). Joint face image restoration and
frontalization for recognition. IEEE Transactions on
Circuits and Systems for Video Technology, 32(3),
1285-1298.
Lu, T., Wang, Y., Zhang, Y., Jiang, J., Wang, Z., & Xiong,
Z. (2024). Rethinking prior-guided face super-
resolution: A new paradigm with facial component
prior. IEEE Transactions on Neural Networks and
Learning Systems, 35(3), 3938-3952.
Wang, Y., Lu, T., Zhang, Y., Wang, Z., Jiang, J., & Xiong,
Z. (2023). FaceFormer: Aggregating global and local
representation for face hallucination. IEEE
Transactions on Circuits and Systems for Video
Technology, 33(6), 2533-2545.
Wang, X., Guo, Y., Yang, Z., & Zhang, J. (2022). Prior-
guided multi-view 3D head reconstruction. IEEE
Transactions on Multimedia, 24, 4028-4040.
Wang, Z., Huang, B., Wang, G., Yi, P., & Jiang, K. (2023).
Masked face recognition dataset and application. IEEE
Transactions on Biometrics, Behavior, and Identity
Science, 5(2), 298-304.
George, A., Ecabert, C., Shahreza, H. O., Kotwal, K., &
Marcel, S. (2024). EdgeFace: Efficient face recognition
model for edge devices. IEEE Transactions on
Biometrics, Behavior, and Identity Science, 6(2), 158-
168.
Alansari, M., Hay, O. A., Javed, S., Shoufan, A., Zweiri,
Y., & Werghi, N. (2023). GhostFaceNets: Lightweight
face recognition model from cheap operations. IEEE
Access, 11, 35429-35446.
Jabberi, M., Wali, A., Neji, B., Beyrouthy, T., & Alimi, A.
M. (2023). Face ShapeNets for 3D face recognition.
IEEE Access, 11, 46240-46256.
Zhu, Y., et al. (2024). Quantum face recognition with
multigate quantum convolutional neural network. IEEE
Transactions on Artificial Intelligence, 5(12), 6330-
6341.
Yang, Y., Hu, W., & Hu, H. (2023). Neutral face learning
and progressive fusion synthesis network for NIR-VIS
face recognition. IEEE Transactions on Circuits and
Systems for Video Technology, 33(10), 5750-5763.
INCOFT 2025 - International Conference on Futuristic Technology
894