Comparative Study of MTCNN and YuNet for Deepfake Detection

M. Tanmay Adithya, C. Tanush, N. Kathyaini and D. Mohit Reddy and G. Mary Swarna Latha

Department of Computer Science and Engineering, Institute of Aeronautical Engineering, Dundigal, Telangana, India

Keywords: Deepfake Detection, MTCNN, YuNet, Deep Learning, InceptionResNetV1, Face Detection.

Abstract: In recent years, deepfakes have become a prominent digital threat, raising concerns about the potential harm

they can inflict on personal privacy and the fabric of society as trust in visual evidence becomes increasingly

compromised. This paper provides an in-depth comparative analysis of MTCNN Vs YuNet face detection

algorithms specifically focused on deepfake detection use cases. This work compares several baseline face

detection models, systematically integrated into an InceptionResNetV1 model for classification to analyze

which preprocessing technique yields the optimal performance for detecting facial manipulation techniques.

To demonstrate the efficacy of this method, thorough experimental evaluations were conducted on the newly

proposed OpenForensics dataset, which is characterized by diverse cases and rich face-level annotations,

leveraging multiple faces in a single image. The results of each of the three reconfigurable system-level

implementations are consistent in that the YuNet-based pipeline gives a significant improvement over the

MTCNN-based system across all the core performance metrics (accuracy 57.2% vs 52.2%; precision 55.0%

vs 51.6%; recall 82.8% vs 81.1%; and F1-score 66.1% vs 63.1%). Moreover, YuNet processes images much

faster, at 0.008 seconds per image on average, compared to MTCNN's 0.024 seconds per image, indicating a

3x computational efficiency improvement. The YuNet pipeline also obtains a more accurate Area Under the

ROC Curve score (0.624 vs 0.544), which measures the ability to accurately classify authentic and

manipulated facial imagery across various classification thresholds. When analyzed more in-depth through

confusion matrices, YuNet shows fewer false negatives as well, proving to be more effective at identifying

deepfake images correctly. These findings collectively suggest that YuNet's enhanced detection capabilities,

coupled with its architecture optimized for low-latency processing, make it significantly more suitable for

real-time deepfake detection applications.

1 INTRODUCTION

Face detection is an essential step in many deepfake

detection pipelines and is often used as a

preprocessing step (G. Gupta, et.al 2023).

Convolutional Neural Networks, a type of deep

learning model, have driven progress in face

detection and deepfake detection. These models have

been demonstrated to be proficient in capturing

complicated features from images and videos to

identify indistinct patterns of manipulation (M. L.

Saini, et.al 2024) Referring to such types of face

detectors, two widely used face detectors, MTCNN

and YuNet, reached popularity for their speed and

accuracy, which make them suitable for the

incorporation of such detectors in the deepfake

detection pipelines. Advanced deepfake technology

makes these signs of detection increasingly difficult

to recognize. As deepfake technology gets better, so

must the detection techniques so that they still work

against the new threats. Deepfake has become a

buzzword to describe the ongoing advances in this

rapidly evolving field that can be threatening if not

properly controlled. Developing reliable deepfake

detection systems is crucial for maintaining trust in

digital media. Deep learning models have become an

essential tool in the detection of deepfakes because

they're capable of detecting hidden patterns and

features inside facial images (V. S. Barpha, et.al 2024)

Certain deepfake detection pipelines perform face

detection as an initial preprocessing step. Next up,

face localization and face alignment: Finding and

cropping facial areas from a single image enables

further analysis to only be conducted within the

specific areas that are more probable to contain

manipulation artifacts (L. Chadha, et.al 2023) With a

face isolated from the background, the detection

process may be more accurate and less

computationally complex since irrelevant background

Adithya, M. T., Tanush, C., Kathyaini, N., Reddy, D. M. and Latha, G. M. S.

Comparative Study of MTCNN and YuNet for Deepfake Detection.

DOI: 10.5220/0013870300004919

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies (ICRDICCT‘25 2025) - Volume 1, pages

615-624

ISBN: 978-989-758-777-1

615

information is no longer present. Moreover, face

detection allows applying methods that study

distinctive aspects of faces, how people express

emotion, and how faces move (G. Gupta, et.al 2023),

all of which are magnified with deepfake

manipulation. For example, analyzing the kind of

blinking patterns which has inconsistencies or lip

movements that are unnatural is more detectable after

the correct facial area has been detected and isolated

(T. T. Nguyen, et.al 2025) Thus, improving the

performance of deepfake detection approaches

strongly relies on the face detection task.

Deepfake Detection by comparison of two widely

used algorithms: MTCNN and YuNetFor the

implementation we start by importing the necessary

libraries. MTCNN–one of the most commonly

employed algorithms for face-detection as in this

study (L. Chadha, et.al 2023). for both accuracy and

speed, which was based on multi-task cascaded

convolutional neural networks. It is able to precisely

identify parts of images containing faces. MTCNN is

the mechanism of choice in many computer vision

tasks due to its great robustness in varying conditions,

including pose, lighting, and occlusion. YuNet is a

newly proposed face detection algorithm that

apparently works better than the state-of-the-art

algorithms (while achieving high accuracy and speed

even on mobile devices). YuNet is designed and

fine-tuned to be very lightweight and super effective,

making it a perfect candidate for both real-time

applications and deployment on computationally

constrained devices, including but not limited to

mobile devices or embedded systems (W. Wu, H.

Peng, et.al 2023) By comparing these two algorithms,

the study intends to provide an informative

perspective on their applicability in the context of

deepfake detection applications in terms of benefits

and drawbacks. The study will evaluate their

performances on detection accuracy, processing

speed, and robustness to changes in the image

conditions like lighting, pose, and occlusion. The

project will also investigate how the choice of face

detection algorithm influences the performance of

InceptionResNetV1, as the latter is used for

classification.

There is a list of related works in Section 2. In

Section 3, the methodology is presented. The results

are presented in Section 4. The discussion is presented

in section 5. The conclusion is presented in section 6.

2 RELATED WORKS

Deepfakes are synthetic media created with the aid of

sophisticated machine learning techniques (most

commonly, Generative Adversarial Networks), and

their rapidly growing prevalence threatens the

veracity of digital information (G. Gupta, et.al 2023)

These deepfakes and cyber-malleable videos and

images possessing a highly realistic nature can make

it challenging for people to differentiate them from

genuineness (G. Gupta, et.al 2023).This has led to a

lot of research work on how to efficiently detect

deepfakes. Early deepfake detection techniques

focused on the visual artifacts left during the forgery

process (S. Lyu, 2020)

Specifics of such artifacts can be among the

blinking patterns (S. Lyu, 2020) unnatural head

movements, and/or differences between lip

movements and uttered words. Although these

methods were effective at first, techniques for

generating deepfakes have since progressed such

that these telltale signs are becoming more subtle and

harder to detect (X. Cao and N. Z. Gong, 2021)

Moreover, how realistic deepfakes are warrants the

advancement of additional detection techniques.

(Xinyooo, 2025) describes several approaches to

deepfake detection, such as those focused on image

quality analysis.

Deep learning has transformed the domain of

computer vision – face detection (Tran The Vinh, et

al.2023) and deepfake detection (M. L. Saini, 2024)

are just some of the examples. Convolutional neural

network models have shown impressive feature

extraction capabilities on images and videos and their

ability to recognize complex manipulation patterns

(M. L. Saini, et.al 2023) Deepfake detection tasks

have shown promising results with architectures like

XceptionNet and EfficientNet (T. Kularkar, et.al

2024).

In addition, (K. Sudarshana, 2021) employed

Recurrent Neural Networks to analyze temporal

inconsistencies in subsets of the video sequence as

another possible approach for deepfake detection.

The performance of InceptionResNetV1 for deepfake

detection has been widely investigated and verified

through existing datasets. By learning very

informative and distinctive face features, it has shown

high detection accuracy and several studies report

accuracy of aliay 95% for the classification of real

and synthetic faces. Furthermore, the architecture

created for this network provides the capacity

necessary without excessive depth, letting it sustain

its processing speed for practical applications of

ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,

COMMUNICATION, AND COMPUTING TECHNOLOGIES

616

deepfake detection in real-time. (V. L. L. Thing, et.al,

2023)

Face detection is a crucial stage in many deepfake

detection pipelines and is regularly perform as

preprocessing (G. Gupta, et.al 2023) and (V. S.

Barpha, et.al 2024) Deep learning model

Convolutional Neural Networks have been

responsible for advances in two other areas: face

detection and deepfake detection. These models have

been shown to effectively learn complex features

from images and videos in order to recognize subtle

signatures of tampering (M. L. Saini, 2024) Such

types of face detectors are known and two of them

that have become so common for their speed and

accuracy in detection and are therefore perfect for this

type of detector to be added to the deepfake detection

pipeline are MTCNN and YuNet and hence can also

be referred to as such face detectors.

With deepfake technology constantly changing,

there is an ongoing need for research and

development of new detection solutions to identify

these evolving challenges. (K. Sudarshana, et.al

2021) Explores recent deepfake detection trends, inter

alia the need for more robust and generalized

detection methods. (S. Dhesi, et.al, 2023) calls for

countermeasures against adversarial attacks and more

sophisticated deepfake generation techniques. (M.

Taeb and H. Chi, 2022) provides a detailed account

of various deepfake detection methods, including

artifact-based, biological signal-based, and

behavioral-based approaches. This study also

discusses relevant datasets used for training and

evaluation, such as UADFV and DFTIMIT.

(S. A. Khan and D. Dang‐Nguyen, 2023) presents

a comparative analysis of various deepfake detection

methods, including early CNN-based approaches like

Meso-4 and MesoInception-4, highlighting the

ongoing evolution of detection techniques in response

to increasingly sophisticated deepfake generation

methods. Another study (V. S. Barpha, et.al 2024)

focuses on leveraging MTCNN for feature extraction

in deepfake detection pipelines while also discussing

broader challenges in deepfake detection and the

evolution of generation techniques. These works

collectively emphasize the need for continuous

research and development of robust deepfake

detection methods to keep pace with advancements in

deep fake generation.

3 METHODOLOGY

3.1 Dataset Selection

For the training and evaluation of the system, the

Open Forensics dataset (T. N. Le, et.al 2021) was

chosen. This has various benefits for research on

deepfake detection. It is a dataset released for the

multi-face forgery detection and segmentation with

rich annotations such as forgery type (real/fake),

bounding boxes, segmentation masks, forgery

boundaries, and facial landmarks for each face (T. N.

Le, et.al 2021). Open Forensics focuses on various

scenarios, making it a more diverse dataset than the

majority of baselines, as these datasets typically

consist of shorter videos with near-duplicate frames,

which leads to better generalization capabilities (T. N.

Le, et.al 2021). Its size (number of images and many

different scenes) was appropriate for deep networks

(T. N. Le, et.al 2021). The dataset is divided into

different faces per image; this is often missing from

other datasets and has faces of varying sizes and

resolutions (T. N. Le, et.al 2021).

In addition, as Open Forensics includes diverse

scenes, including a variety of outdoor scenes, it also

contributes to increasing the robustness of trained

models (M. Taeb and H. Chi, 2022) Furthermore, the

focus on fine-grained face-wise annotations, as well

as varying scenarios covered by this dataset,

encourage the development of state-of-the-art deep-

fake detection and segmentation capabilities. In

Table I, we present the dataset distribution

comparison across the training, testing, and validation

splits and ensure that in any set of splits, the model

can be evaluated on equal numbers of real and fake

images.

Table 1: Open Forensics Dataset (Source: T. N. Le, et.al

2021).

Dataset Split

Real

Images

Fake

Images

Total

Training 70,001 70,001 1,40,002

Validation 19,787 19,641 39,428

Testing 5,413 5,492 10,905

Comparative Study of MTCNN and YuNet for Deepfake Detection

617

Figure 1: Sample Images from the Dataset, Including Real

(Left) and Fake (Right) Images Used for Training and

Evaluation.

3.2 Face Detection Models

The face detection algorithm MTCNN J. Du, 2020),

Y. Chai, 2021) (V. S. Barpha, et.al 2024), (G. Gupta,

et.al 2023). has been used a lot due to this multi-task

learning feature. It executes face detection and facial

landmark localization at the same time using a three-

stage cascaded framework:

1. Proposal Network: Rapidly generates

candidate face regions.

2. Refinement Network: Filters the candidate

regions, refining the bounding boxes.

3. Output Network: Further refines the

detections and outputs facial landmarks.

MTCNN being trained on multi-task learning and

the cascade architecture, allows for an efficient and

accurate estimation of face locations in images with

complex conditions including pose, lighting and

occlusion (E. Wahab, et. al 2025) This has already

been shown to be a powerful preprocessing step for

deepfake detection pipelines in (L. Chadha, et.al

2023). One study, MTCNN was used to extract

features for each face and the results improve the

model's accuracy for deepfake detection (L. Chadha,

et.al 2023). However, MTCNN was also shown to be

a strong face detection algorithm when occlusion

was present (E. Wahab, et. al 2025) MTCNN was

implemented using common Deep Learning libraries

such as TensorFlow or PyTorch. As numerous pre-

trained models exist and are available for deepfake

detection pipeline inclusion, it enables fast

deployment. License plate detection has been

recognized as a preprocessing step that adds

significant value in subsequent classification tasks

and thus, detection algorithms such as (MTCNN)

have proved quite efficient in extracting facial

features from the provided data (G. Gupta, et.al

2023).

YuNet W. Wu, H. Peng, et.al 2023), a more recent

face detector, prioritizes speed and efficiency,

particularly on resource-constrained devices. It stands

out as a "tiny" face detector, achieving an 81.1%

Average Precision on the WIDER FACE validation

hard set W. Wu, H. Peng, et.al 2023). As a lightweight

model, YuNet demonstrates its effectiveness on the

WIDER Face dataset, scoring 0.834, 0.824, and 0.708

on the validation set. Optimized for fast detection

with a low computational footprint, YuNet can be

used for real-time applications on low-end devices

and operates on images with face sizes ranging from

10x10 to 300x300 pixels W. Wu, H. Peng, et.al 2023).

However, little is publicly available in terms of its

internal structuring (Tran The Vinh, et al.2023). Note

that the ONNX model uses a fixed input shape, while

OpenCV DNN can read the exact image shape

dynamically W. Wu, H. Peng, et.al 2023).

The cropped facial areas are subsequently applied

as input to a classifier (InceptionResNetV1 in the

present study) trained to recognize the difference

between real and fake faces. The outputs of the two

pipelines are compared and analyzed to assess the

performance of MTCNN and YuNet in terms of

detection accuracy, processing speed, and general

robustness. The pipelines are built in Python and

standard deep learning frameworks. This is to speed

up the development process, as pre-trained models

for MTCNN, YuNet, and InceptionResNetV1 are

used.

3.3 Feature Extraction and

Classification Using

InceptionResNetV1

The faces that have been detected are passed to the

Image Classification step, which uses

InceptionResNetV1 as a feature extractor for

classification. InceptionResNetV1 uses both

Inception and ResNet to learn complex facial features

and patterns that assist it in differentiating between a

real face and fake faces. FMN was selected due to its

demonstrated effectiveness for image classification

tasks (C. Szegedy,2017) So this learning of complex

features, makes it a good candidate for deepfake

detection, where ultimate modifications of images

will be shown. While other architectures such as

XceptionNet have demonstrated promising results,

InceptionResNetV1 was selected in the course of this

study owing to initial performance testing and

computational resource restrictions. Once trained on

this huge dataset of real and fake (manipulated) face

ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,

COMMUNICATION, AND COMPUTING TECHNOLOGIES

618

images, the InceptionResNetV1 algorithm can output

dimensionality-reduced embeddings of an input face

that captures detailed information on the fine

differences that make an image real or fake. Lv et

al.'s embeddings form the basis of the key facial

features present in an image, which play a key role in

identifying whether that image is real or fake (L.

Chadha, et.al 2023)

In order to solve the problem, InceptionResNetV1

adopts Inception module with residual connection

attached to them (L. Chadha, et.al 2023) yet another

form of Inception architecture. Residual or skip

connections make the entire network train fast and

achieve better performance (G. Gupta, et.al 2023).

[Then InceptionResNetV1 features are used to train a

binary classifier that would classify the faces as fact

and fake. The classifier is fine-tuned on the deepfake

dataset for detection performance.

3.4 Evaluation Metrics and

Performance Assessment

To comprehensively assess the performance of each

face detection model in the deepfake detection

pipeline, an extensive evaluation framework is

implemented. This framework includes several

complementary metrics that together give a full

picture of the effectiveness of a model on different

points of performance.

The following main metrics are used to evaluate

the performance: accuracy, precision, recall, F1-

score, area under the Receiver Operating

Characteristic (AUC-ROC) curve, and estimation of

computation complexity. Accuracy is the general

ratio of accurately recognized samples (both real and

fake) to the total number of samples passed through

the model. Although this metric gives a rough idea of

model performance, it can be misleading in the case

of class imbalance, thus demanding the inclusion of

more metrics.

Precision measures the model's ability to avoid

false positives and is calculated as the number of

correctly identified deepfakes over the total number

of samples that were classified as deepfakes. This

measure is especially important for applications

where making false accusations of tampering could

be highly damaging. On the other hand, recall (or

sensitivity) measures the ability of the model to

identify real deepfakes by measuring the percentage

of true deepfakes detected over the total number of

real deepfakes present in the dataset. The high recall

value suggests few false negatives, which is vital in

security-critical applications in which a false negative

(i.e., failing to detect a deepfake at all) could cause

severe damage.

To balance the trade-off between precision and

recall, the F1-score the harmonic mean of precision

and recall is computed, providing a single metric that

accounts for both false positives and false negatives.

This balanced measure is especially valuable when

the costs of false positives and false negatives are

comparable. The area under the receiver operating

characteristic curve (AUC-ROC) is examined,

representing the discriminative power of the model

over multiple classification cutoffs. The receiver

operating characteristic curve (ROC curve) is a

graphical plot that illustrates the diagnostic ability of

a binary classifier system by plotting its true positive

rate (Recall) against the false positive rate (Fall-out)

at various threshold settings, and the area under the

ROC curve (AUC) is a value between 0 and 1. An

ideal model would give a perfect AUC of 1, whereas

a random classifier would have an AUC of around

0.5. This measure summarizes the model's

performance over the entire operating range at all

possible thresholds.

Beyond classification performance,

computational efficiency is measured in terms of

processing speed. These encompass the duration of

facial identification, feature extraction, and later

classification of deepfakes. All experiments are run

on the same hardware configurations for fair

comparisons, also providing the average processing

time per image and throughput (images processed per

second).

All models are tested on a held-out test set, which

was not seen by the training data to prevent

overfitting and ensure the validity and reliability of

the results. The diverse range of deepfake types,

facial features, and environmental conditions present

in this test set enables a comprehensive evaluation of

model generalization capabilities. Moreover,

stratified cross-validation is used to reduce the impact

of dataset splitting on the evaluation results.

4 RESULTS AND EVALUATION

4.1 Quantitative Evaluation

Table II shows the overall performance of both the

MTCNN and YuNet face detection methods with the

InceptionResNetV1 model for image binary

classification of deepfakes. The Evaluation was

achieved by calculating some metrics, i.e., Accuracy,

Precision, Recall, and F1-score. Comparing the

performance of both pipelines in terms of applicable

Comparative Study of MTCNN and YuNet for Deepfake Detection

619

metrics, the YuNet-based pipeline outperformed the

MTCNN-based pipeline, as seen in Table II.

Specifically, YuNet has an accuracy of 57.2%, higher

than the 52.2% accuracy achieved by MTCNN, thus

showcasing its overall better efficacy in classifying

the images correctly. In addition, YuNet provides

greater precision (55.0% vs. 51.6%), indicating a

lower false positive ratio. Importantly, they are also

higher for YuNet in terms of recall (82.8% vs.

81.1%) and F1-score (66.1% vs. 63.1%), indicative of

an improvement.

Table 2: Overall Performance (Source: Author).

Metric MTCNN YuNet

Accuracy 0.522 0.572

Precision 0.516 0.550

Recall 0.811 0.828

F1-Score 0.631 0.661

4.2 Detailed Statistics for Fake Images

Table III presents a statistical analysis of MTCNN

and YuNet performance metrics when processing the

fake image subset. The results reveal notable

differences between these face detection algorithms

in deepfake detection tasks. YuNet demonstrates a

significant computational advantage, processing fake

images at 0.008 seconds per image—approximately

three times faster than MTCNN's 0.024 seconds. This

efficiency difference has important implications for

real-time applications and resource-constrained

deployment scenarios.

In terms of detection capability, YuNet has a

slightly higher detection rate (82.8%) than MTCNN

(81.1%), which indicates that YuNet can perform

more robustly while identifying facial areas among

the manipulated content. But MTCNN shows a

slightly higher mean (0.802 vs 0.766) confidence

value, suggesting that while it detects fewer faces

overall, it has a greater confidence value in what it did

detect.

The Area Under the Curve (AUC) score shows

that YuNet (0.624) outperforms MTCNN (0.544) by

8 percentage points. This substantial improvement in

classification performance indicates that the YuNet-

based pipeline possesses superior capability in

distinguishing authentic from manipulated facial

content across various threshold settings. It can be

concluded that YuNet showcases a more cost-

effective solution allowing for bite-size real-time

detection on limited hardware devices, as well as over

a mobile client such as NVIDIA Jetson or a

combination of smartphone devices. This is attributed

to YuNet's higher processing speed, which allows for

faster analysis of images and videos, crucial for real-

time applications. Its lightweight architecture and

efficient computation also make it suitable for

deployment in environments with limited

computational resources.

Table 3: Detailed Statistics for MTCNN and YuNet for

Fake Image Subset (Source: Author).

Statistic

MTCN

YuNet

Average Processing Time

(

)

0.024 0.008

Detection Rate (%) 81.1 82.8

Average Detection

Confidence

0.802 0.766

Total Fake Images

Processe

5492 5492

Fake Images with Faces

Detected

4453 4548

AUC Score 0.544 0.624

4.3 Graphical Analysis

4.3.1 Processing Time Comparison

A comparative box plot of the processing times for

MTCNN and YuNet when detecting faces in fake

images is represented in Figure 2. Average processing

time using YuNet: 0.008 seconds per image; 0.024

seconds per image using MTCNN. With this

threefold speedup in the processing speed, YuNet is

efficient and can be applied in use cases like real-time

deepfake detection or analyzing large image datasets

efficiently.

Figure 2: Processing Time Comparison for Fake Images

Using MTCNN and YuNet.

ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,

COMMUNICATION, AND COMPUTING TECHNOLOGIES

620

4.3.2 Face Detection Rate

In Figure 3, the comparison of detection rate is

shown, i.e., how well each algorithm manages to

detect at least one face in the fake images. With a

detection rate of 81.1%, MTCNN achieves an

improvement in accuracy, while YuNet shows a small

but significant margin at 82.8%. The data-driven

nature of YuNet's architecture, with a higher number

of parameters, highlights its improved capability in

detecting facial features in manipulated content,

which may allow for better, more nuanced

differentiation of facial features than what is possible

with MTCNN. While the difference in detection rates

may seem minor, it can have a significant impact on

the overall performance of the deepfake detection

system.

Figure 3: Face Detection Rate for Fake Images Using

MTCNN and YuNet.

4.3.3 Detection Confidence Distribution

Figure 4 shows confidence score distributions for the

two algorithms on fake images when detecting faces.

Interestingly, despite having an average confidence

level higher than YuNet, namely 0.802 vs 0.766 for

YuNet, the overall detection performance is not

better. This is an interesting observation and indicates

that confidence scores must be interpreted carefully,

as more confident models might not perform better in

classifying deepfakes well.

Figure 4: Detection Confidence Distribution for Fake

Images Using MTCNN and YuNet.

4.3.4 ROC Curve Analysis

The ROC curves for both detection systems are

presented in Figure 5. The results further show that

while MTCNN achieves an AUC of 0.544, YuNet

obtains an AUC of 0.624. Notably, the large margin

in terms of discriminative ability suggests that

YuNet can achieve more stable classification under

different threshold conditions, resulting in

enhancement in the overall performance of genuine

versus forged face recognition.

Figure 5: Roc Curves for Deepfake Detection Using

MTCNN and YuNet.

Comparative Study of MTCNN and YuNet for Deepfake Detection

621

4.3.5 Confusion Matrices

Figure 6: Confusion Matrices for MTCNN and YuNet.

Confusion matrices are a standard tool for analyzing

the results of binary classification tasks. Figure 6

presents the confusion matrices generated for both

models, providing a visual representation of true

positives (TP), false positives (FP), true negatives

(TN), and false negatives (FN). The confusion matrix

analysis revealed a key difference in the models'

performance. For MTCNN, the counts were: 1243

true negatives, 4170 false positives, 1039 false

negatives, and 4453 true positives. For YuNet, the

counts were: 1688 true negatives, 3725 false

positives, 944 false negatives, and 4548 true

positives. Notably, YuNet produced significantly

fewer false negatives (944) compared to MTCNN

(1039). This indicates that YuNet is more sensitive in

detecting manipulated images, meaning it is less

likely to miss a true manipulated instance. This

superior sensitivity is crucial in security-focused

applications where the cost of failing to detect a

deepfake (a false negative) can be substantial.

5 DISCUSSION

The comparative analysis of MTCNN and YuNet for

deepfake detection reveals several significant insights

with important implications for real-world

applications. The YuNet-based pipeline consistently

outperforms the MTCNN-based approach across

multiple performance metrics, establishing it as the

superior choice for deepfake detection systems.

The YuNet-based pipeline outperforms the

MTCNN-based pipeline in terms of accuracy,

precision, recall, and F1-score. The boosted

performance is due to YuNet's improved face

detection, which achieves a higher success rate at

matching even in fake visuals. This higher detection

rate means that deepfakes can be identified more

accurately, which is important in any security or

authentication application.

YuNet-based pipeline has a notably quicker

processing speed than MTCNN. The efficiency

advantage can be explained by the architectural

design of YuNet, optimized especially for low-

latency face detection operations. This three-times

acceleration in processing time is a significant

advantage for any real-time applications where

computational efficiency is a critical factor, including

live video analysis or high-throughput image

processing systems. Moreover, the YuNet-based

pipeline achieves a greater AUC score than the

MTCNN-based pipeline, emphasizing its enhanced

performance in separating the real and manipulated

facial images. This improved discriminative ability is

additionally confirmed via confusion matrix analysis,

which indicates how YuNet yields fewer false

negatives and thus exhibits a higher capability of

accurately recognizing deepfake images. Even

though MTCNN has a slightly higher mean average

of the detection confidence, the detection rate is

lower, which shows that good confidence scores do

not have to equal better classification performance.

This observation emphasizes the need for a more

comprehensive evaluation of deepfake detection

systems based on various complementary findings

instead of confidence scores alone.

ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,

COMMUNICATION, AND COMPUTING TECHNOLOGIES

622

The performance metrics of this comparative

study classify the YuNet-based deepfake detection

pipeline to be better than the MTCNN-based

approach in terms of accuracy, processing time, and

robustness. The shown benefits of the YuNet-based

pipeline indicate that it is a feasible option for

practical scenarios in deepfake detection, proving to

be more reliable and more efficient in tackling the

issue of synthetic media manipulation.

6 CONCLUSIONS

The comparative study on MTCNN and YuNet face

detection systems with InceptionResNetV1 as the

core recognition system for deepfake detection

indicates significant advantages of the YuNet-based

pipeline. The experimental results show that YuNet

outperforms MTCNN in terms of all three important

performance factors: detection accuracy, processing

speed, and robustness to different input conditions.

The above benefits are directly attributable to the

improved face detection portion of YuNet and the

architectural modifications made to minimize

latency, which render it especially suitable for low-

latency applications like deepfake detection.

This study lays the groundwork for multiple

intriguing avenues of future work. One promising

direction is utilizing transferable learnings to adapt

pre-trained InceptionResNetV1 models by tuning

them to domain-specific datasets that more accurately

reflect the changing landscape of synthetic media.

While some recent works have focused solely on

detector training with new data through transfer

learning, others have combined architectural

innovations from contemporary state-of-the-art

neural networks, potentially benefiting detection

performance on progressively more advanced

deepfake content that leverage subtle facial artifacts.

Another valuable future work direction is robust

data augmentation strategies. Adopting higher-order

methods like geometric transformations, noise

functions, and adversarial training can increase the

coverage of model generalization to address a

multitude of deepfake generation techniques. Such

techniques would allow detection systems to remain

effective even as deepfake technologies grow in

complexity and subtlety. Ensemble methodologies

also deserve to be thoroughly studied. Ensemble

detection models involving augmenting model’s

expertise such as manipulation models or artifact

domains could pave the way towards more complete

and generalized detection systems. Ensemble

techniques can include combinations of

convolutional neural networks with transformer

architectures, as well as spatial and temporal analysis

for video deepfakes, further enhancing the detection

of synthetic media that often requires multiple inputs.

Beyond the current rendering of facial detection,

future research needs to explore multimodal avenues

that the user considers both with visual and audio

features of media in order to detect discrepancies

typical of deepfakes. In addition, effective detection

pipelines can also focus on developing lightweight

versions and deploying them on edge devices for

broader application of deepfake detection

technologies.

While state-of-the-art models to generate

synthetic media have continued to become more

sophisticated and are also easier to acquire and use,

the effects that deepfakes could potentially have on

different aspects of society are alarming.

This arms race between generation and detection

technology requires that detection methodologies

continue to evolve. This study offers important

contributions to this vital domain by extending the

knowledge of effective strategies for sustaining

digital media authenticity in a world filled with ever-

more persuasive synthetic media.

These results argue for the informativeness of

video as a medium for more effective deepfake

detection systems and the need to continue funding

research in this area to maintain the integrity of the

information across the digital ecosystem. Future

interdisciplinary collaboration between computer

vision specialists, security researchers, and media

forensics experts will be critical to create holistic

solutions to these growing threats.

REFERENCES

C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi,

“Inception-v4, Inception-ResNet and the Impact of

Residual Connections on Learning,” Feb. 12, 2017,

Association for the Advancement of Artificial

Intelligence. https://doi.org/10.1609/aaai.v31i1.11231

E. Wahab, W. Shafique, H. Amir, S. Javed, and M. Marouf,

“Robust face detection and identification under

occlusion using MTCNN and ResNet50,” Jan. 19,

2025. https://doi.org/10.30537/sjet.v7i2.1499.

G. Gupta, K. Raja, M. Gupta, T. Jan, S. T. Whiteside, and

M. Prasad, “A comprehensive review of deepfake

detection using advanced machine learning and fusion

methods,” Electronics, vol.13, no.1, p.95, Dec.25,2023.

https://doi.org/10.3390/electronics13010095.

J. Newman, “AI-generated fakes launch a software arms

race,” NiemanLab, Dec. 2018. [Online]. Available:

https://www.niemanlab.org/2018/12/ai-generated-

Comparative Study of MTCNN and YuNet for Deepfake Detection

623

fakes-launch-a-software-arms-race (Accessed: Feb. 20,

2025).

J. Du, "High-precision portrait classification based on

MTCNN and its application on similarity judgement,"

J. Phys.: Conf. Ser., vol. 1518, no. 1, p. 012066, 2020,

doi: 10.1088/1742-6596/1518/1/012066.

K. Sudarshana and M. C. Mylarareddy, "Recent Trends in

Deepfake Detection," in Deep Natural Language

Processing and AI Applications for Industry 5.0, 2021,

pp. 1-28. DOI: 10.4018/978-1-7998-7728-8.ch001.

L. Chadha, H. Kulasrestha, V. Bhargava, and V. Jindal,

“Improvised approach to deepfake detection,” 2023.

M. Taeb and H. Chi, “Comparison of Deepfake Detection

Techniques through Deep Learning,” Mar. 04, 2022,

Multidisciplinary Digital Publishing Institute. doi:

10.3390/jcp2010007.

M. L. Saini, A. Patnaik, Mahadev, D. C. Sati, and R.

Kumar, "Deepfake Detection System Using Deep

Neural Networks," in Proceedings of the 2024 2nd

International Conference on Computer, Communicati-

on and Control (IC4), Indore, India, Feb. 8–10, 2024.

DOI: 10.1109/IC457434.2024.10486659.

S. Lyu, “Deepfake detection: Current challenges and next

steps,” in Proc. IEEE Int. Conf. Multimedia Expo

Workshops (ICMEW), Jun. 9, 2020. https://doi.org/10

.1109/ICMEW46912.2020.9105991

S. A. Khan and D. Dang‐Nguyen, “Deepfake Detection: A

Comparative Analysis,” Jan. 01, 2023, Cornell

University. https://doi.org/10.48550/arxiv.2308.03471.

S. Dhesi, L. Fontes, P. Machado, I. K. Ihianle, F. F. Tash,

and D. A. Adama, “Mitigating adversarial attacks in

deepfake detection: An exploration of perturbation and

AI techniques,” Cornell University, Jan. 1, 2023.

https://doi.org/10.48550/arXiv.2302.11704.

T. T. Nguyen, C. M. Nguyen, D. T. Nguyen, and S.

Nahavandi, “Deep learning for deepfakes creation and

detection,” Cornell University, Sep. 25, 2019. [Online].

Available: http://arxiv.org/pdf/1909.11573.pdf (Acce-

ssed: Jan. 2025).

T. N. Le, H. H. Nguyen, J. Yamagishi, and I. Echizen,

“OpenForensics: Large-scale challenging dataset for

multi-face forgery detection and segmentation in-the-

wild,” Oct. 1, 2021. https://doi.org/10.1109/iccv48922.

2021.00996.

T. Kularkar, T. Jikar, V. Rewaskar, K. Dhawale, A.

Thomas, and M. Madankar, “Deepfake Detection Using

LSTM and ResNext,” 2024 https://www.ijcrt.org/pape

rs/IJCRT2311476.pdf

Tran The Vinh, Nguyen Thi Khanh Tien, Tran Kim Thanh,

“A survey on deep learning-based face detection”.

Applied Aspects of Information Technology. 2023;

Vol. 6 No. 2: 201– 217. DOI: https://doi.org/10.15276

/aait.06.2023.15.

V. L. L. Thing, "Deepfake Detection with Deep Learning:

Convolutional Neural Networks versus Transformers,"

2023. [Online]. Available: https://arxiv.org/pdf/2304.0

3698.

V. S. Barpha, R. Bagrecha, S. Mishra, and S. Gupta,

“Enhancing deepfake detection: Leveraging MTCNN

and Inception ResNet V1,” May 2024.

W. Wu, H. Peng, and S. Yu, “YuNet: A tiny millisecond-

level face detector,” Apr. 18, 2023. https://doi.org/10.

1007/s11633-023-1423-y.

X. Cao and N. Z. Gong, "Understanding the Security of

Deepfake Detection," arXiv:2107.02045, 2021. [Onlin

e]. Available: https://doi.org/10.48550/arXiv.2107.020

45.

Xinyooo, “Deepfake detection,” Jan. 2020. [Online].

Available: https://github.com/xinyooo/deepfake- detec

tion (Accessed: Feb. 24, 2025).

Y. Chai, J. Liu, and Y. Li, "Facial target detection and

keypoints location study using MTCNN model," J.

Phys.: Conf. Ser., vol. 2010, no. 1, p. 012097, 2021, doi:

10.1088/1742-6596/2010/1/012097.

Y. Gautham, R. Sindhu, and J. Jenitta, “Review on

detection of deepfake in images and videos,” Jul. 26,

2024. https://doi.org/10.5120/ijca2024923825

ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,

COMMUNICATION, AND COMPUTING TECHNOLOGIES

624