REAL-TIME FACE SWAPPING IN VIDEO SEQUENCES

Magic Mirror

Nuri Murat Arar

, Fatma G

uney

, Nasuh Kaan Bekmezci

, Hua Gao

and Hazım Kemal Ekenel

1,2,3

Department of Computer Engineering, Bogazici University, Istanbul, Turkey

Institute for Anthropomatics, Karlsruhe Institute of Technology, Karlsruhe, Germany

Department of Computer Engineering, Istanbul Technical University, Istanbul, Turkey

Keywords:

Face Swapping, Face Blending, Face Replacement, Face Detection, Active Appearance Model, Face Tracking.

Abstract:

Magic Mirror is a face swapping tool that replaces the user’s face with a selected famous person’s face in

a database. The system consists of a user interface from which the user can select one of the celebrities

listed. Upon selection, model ﬁtting automatically starts using the results of face detection and facial feature

localization. Model ﬁtting results in a set of points which describe the estimated shape of user’s face. By using

the shape information, user’s face is replaced with the selected celebrity’s face. After some post processing for

color and lighting adjustments are applied, ﬁnal output is displayed to the user. The proposed system is able

to run in real-time and generates satisfactory face swapping which can be applied for face de-identiﬁcation in

videos or other entertainment applications.

1 INTRODUCTION

Advances in image analysis gave rise to applications

of face processing techniques in several areas. Face

replacement studies gains importance due to the in-

creasing publicity of personal photographs. Besides,

studies in face swapping and morphing produce the

most intriguing systems regarding facial image anal-

ysis. In this study, we present such an interesting

system for fully-automatic face swapping in video se-

quences. The system performs face swapping on the

images acquired by a video capturing device, which

is why we call it as Magic Mirror. User interacts with

the system via a user interface. By using this inter-

face, user is presented with a list of celebrities and

requested to select the face for swapping. A snapshot

of the user interface and a sample output of the system

can be seen in Fig. 1.

Magic Mirror can be used as a tool to amuse peo-

ple by swapping their faces with famous people’s

faces from politics, sports, and entertainment sectors

such as movie stars, popular musicians, or simply

swapping friends’ faces with each other. It is possible

to insert Magic Mirror in popular chat applications as

a plug-in option for video chats. Magic Mirror might

also be used for personal security purposes such as

identity protection, face de-identiﬁcation, against in-

creasing availability of visual media.

Figure 1: Example snapshots that shows how the system

executes via the user interface of the system.

There are a few studies regarding both face swap-

ping and face de-identiﬁcation. Existing studies

(Bitouk et al., 2008) and (Blanz et al., 2004), present

the technique for face swapping on individual images.

The main contribution of our study is the efﬁcient use

and effective combination of available methods for

Murat Arar N., Güney F., Kaan Bekmezci N., Gao H. and Kemal Ekenel H..

REAL-TIME FACE SWAPPING IN VIDEO SEQUENCES - Magic Mirror.

DOI: 10.5220/0003828000480053

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2012), pages 48-53

ISBN: 978-989-8565-04-4

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

face detection, facial feature localization, face mod-

eling, and face tracking in order to realize a fully

automatic, real-time face replacement system. We

achieved a plausible performance in spite of conti-

nuity and short execution time constraints caused by

real-time video processing. We used Active Appear-

ance Model (AAM) (Cootes et al., 2001) to model the

face, therefore, our study does not put a constraint on

similarity of pose angles as in (Bitouk et al., 2008).

In (Blanz et al., 2004), the performance of the sys-

tem strictly depends on manual initialization. Also,

3D modeling of faces requires more complicated im-

plementation issues. On the other hand, our study

presents a fully automatic way of swapping faces

successfully using a simpler approach. Additionally,

the work in (Gross et al., 2006) achieves face de-

identiﬁcation using AAM in images. The main differ-

ence of this work from ours is that they used AAM to

produce different appearances by changing model pa-

rameters to de-identify faces, whereas, we use AAM

to obtain face contour in different images, align and

swap them for de-identiﬁcation.

The rest of the paper is organized as follows; Sec-

tion 2 introduces related work and Section 3 explains

a detailed description on the proposed method. Exper-

imental results and discussions are given in Section 4.

Finally, Section 5 concludes the paper.

2 RELATED WORK

There are both 2D and 3D studies related to face

swapping, face replacement or face de-identiﬁcation.

The work in (Blanz et al., 2004) presents a sys-

tem that exchanges faces across large differences in

viewpoint and illumination in images. It allows user

to replace faces in images for a hair-style try-on. They

ﬁt a morphable 3D model to both input and target face

images by estimating shape, pose and the direction of

the illumination. The 3D face reconstructed from the

input image is rendered using pose and illumination

parameters obtained by the target image. This system

requires manual initialization for accurate alignment

between the model and the faces images.

The work in (Gross et al., 2006) introduces a

framework for de-identifying facial images. The ma-

jority of the privacy protection schemes currently

used in practice, rely on ad-hoc methods such as pixe-

lation or blurring of the face, they, instead, combine a

model-based face image parameterization with a for-

mal privacy protection model. They change the iden-

tifying points on the face according to the protection

model. They utilize AAM in order to model the faces

and ﬁnd the identity information on them.

The method proposed in (Bitouk et al., 2008) au-

tomatically replaces faces in images. They construct a

large library of face images which are extracted from

images obtained by the internet using a face detection

software and aligned to a common coordinate system.

Their replacement is composed of three stages: First

they detect a face on the input image and align it to

the coordinate system in order to ﬁnd candidate faces

which are similar to the input face in terms of pose

and appearance. Then, they adjust pose, lighting, and

color properties of the candidate face images accord-

ing to the input image, and perform swapping with

the input image. They rank the resulting face replace-

ments according to a match distance and display the

top ranked ones as results.

3 METHODOLOGY

3.1 System Overview

We propose a face swapping mechanism in video se-

quences by both combining some known computer

vision techniques, such as face detection and AAM,

and providing a face alignment procedure. We used

modiﬁed census transform (MCT) feature-based face

detection (Kublbeck and Ernst, 2006) and Active Ap-

pearance Model (AAM) to locate and model the face

and its attributes. For the face alignment procedure,

we use piecewise-afﬁne warping. The basic steps of

our face swapping approach are shown in Fig. 2.

Mainly, there are two independent execution mech-

anisms; ofﬂine and online processes. The ofﬂine pro-

cess is a one time procedure to build an AAM used

during the online process. The AAM is built from a

set of landmarked faces and this model is loaded each

time the real-time system is started.

Figure 2: System overview.

The online process is executed each time the sys-

tem starts and the internal steps including face de-

tection and tracking, AAM ﬁtting, face alignment,

face swapping and post processing are sequentially

performed. That is, results of the detection step are

used for the construction of initial shape required in

AAM modeling step and alignment through warping

REAL-TIME FACE SWAPPING IN VIDEO SEQUENCES - Magic Mirror

produces the coordinates used in swapping. The on-

line process starts with a welcome screen in which a

user is asked to select a face from the list of celebrity

faces. This face chosen by the user is called the target

face. Then, face detection and AAM ﬁtting is per-

formed on the target face. The best ﬁtted shape for

the target face is obtained. Then, video frames are

captured from a video camera which are called as in-

put frames. The similar procedure explained above is

applied on the input frames. That is, face detection

is performed on the input frames to locate the faces.

For each input frame, face detection results are fed

into AAM ﬁtting step and an estimated face shape is

obtained. After face detection and AAM ﬁtting steps,

we have the best ﬁtted shapes of both input face and

target face, but, for a successful swapping, we need

them to be in the same scale on a common coordinate

system. Then, we warp the target shape to the input

shape for the exact alignment. We swap two faces

by exchanging the pixel intensities of aligned target

face with corresponding intensities of the input face

using shape information of both faces. After swap-

ping, in order to enhance the reality of output images,

we apply post-processing by computing output image

as a weighted combination of input and target face for

color and lighting adjustment. After post-processing,

the output image is displayed to the user. Fig. 3 shows

an example sequence of input image, target image and

the output image, respectively.

Figure 3: Input image (left), target image (middle) and out-

put image (right).

3.2 Face Detection using MCT

We use MCT frontal face detector, which advocates

the use of inherently illumination invariant image fea-

tures for object detection, to locate the face in video

frames. The features convey only structural object

information which are computed from a 3 × 3 pixel

neighborhood using the modiﬁed census transform.

For classiﬁcation, this feature set is used with a four-

stage classiﬁer cascade. Each stage classiﬁer is a lin-

ear classiﬁer, which consists of a set of lookup-tables

of feature weights. Detection is carried out by scan-

ning all possible analysis windows of a particular size.

Each window is classiﬁed as either background or

face. Detector has four classiﬁers of increasing com-

plexity. Each stage has the ability to reject current

analysis window or pass it to the next stage accord-

ing to the outcome of a thresholding operation. As a

result of four stage classiﬁer, the detector determines

the location and scale of the face on the image. Using

the same algorithm we can train detectors for individ-

ual facial components such as eyes and mouth. An

example of detection output is displayed in Fig. 4.

Figure 4: Outputs of face detection step: bounding face

rectangle, mouth and eyes.

3.3 Active Appearance Model

AAM is known as a deformable model which de-

ﬁnes the statistical modeling of shape and texture of a

known object and its ﬁtting algorithm. Many face re-

lated studies use AAM to model the shape of the face.

Shape is deﬁned by a set of points on the common

coordinate system of the training set used. First, a

mean shape and modes of variation is learned from

training data by applying PCA. Then, by changing

the mean shape through different variations, which

are deﬁned in modes of variation, various instances

of shape can be obtained. Texture information is a

statistical model of the gray-level image and captured

by sampling with a suitable image warping function

deﬁned by the shape parameter.

3.3.1 AAM Building

AAM building process is an ofﬂine process performed

only once before the system starts to operate. We used

the IMM face database (Nordstrm et al., 2004) con-

taining 240 annotated monocular images of 40 differ-

ent human faces in order to build the model. Each

image in the database is annotated with 58 landmarks

for deﬁning the shape of the face (Fig. 5). AAM is

trained with these annotated images.

3.3.2 AAM Fitting

AAM ﬁtting is used to obtain shape of both target and

input faces. AAM ﬁtting is an iterative process of ad-

justing model parameters to ﬁt the model to unseen

face images (Cootes et al., 2001). An initial shape

for the detected face is constructed by using the facial

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

Figure 5: Sample original image in IMM face database

(left) and annotated image (right).

features detected in the ﬁrst step. First, an afﬁne trans-

form is estimated according to the facial features on

the input image and the model shape. Then, estimated

transformation is performed on the model shape so

that the differences in locations between the feature

points on the model shape and the input image are

minimized (Gao, 2008). The result of AAM ﬁtting is

an optimized model shape which best approximates

the face on the current image Fig. 6. This optimiza-

tion is performed using the simultaneous inverse com-

positional algorithm as stated in (Gross et al., 2005).

Figure 6: AAM ﬁtting results of input image (left) and tar-

get image (right).

3.3.3 Tracking with AAM

Face detection is the most expensive step in terms of

time constraints. Since this application aims to do

face swapping in real time, using a tracker instead of

performing face detection on each frame saves a lot

of time. Based on the fact that face in the current

frame cannot be too far away from the face found in

the previous frame, a tracking mechanism using AAM

can be developed in this context. AAM ﬁtting is per-

formed on each frame by using the ﬁtting results, i.e.,

shape and appearance parameters of previous frame

as the initialization of current frame (Gao, 2008).

Face is located using the face detector in the ﬁrst

frame, after that, best ﬁt shape of previous frame is

directly fed to the AAM as the initial shape of cur-

rent frame. An error measure that yields low value

for good face model ﬁtting and high value for poor ﬁt-

ting is deﬁned to indicate the quality of AAM ﬁtting.

Whenever the error is high, it indicates that the track-

ing or ﬁtting failed and a reinitialization is required to

continue tracking on the following frames. Reinitial-

ization is carried out by detecting face and feeding the

initial shape constructed from it to AAM ﬁtting step.

3.4 Face Alignment

Faces registered in the system are extracted in vari-

ous sizes and orientations. Input frames coming from

video camera also differ in size and orientation, be-

cause user may not stand in front of the the camera

at constant distance and position. In order to perform

directly swapping of input and target faces, an align-

ment step is necessary to bring various sized and po-

sitioned face images into a common coordinate sys-

tem with a particular size. For the alignment pro-

cess, ﬁrstly we bring RGB colored face of the cho-

sen celebrity by using its best ﬁt shape and its origi-

nal image into a coordinate system so that we obtain

the aligned target face. Then, we bring RGB colored

face of the input frame into a coordinate system so

that we obtain the aligned input face. To do this, a

piecewise-afﬁne warping is applied with triangulated

mesh. Pixels inside each triangle are determined by

bi-linear interpolation. At this point, these faces are

aligned in terms of their positions but their sizes are

different and also their shapes are different in terms of

width, height and the locations of points. Finally, we

warp the aligned face of the target face to the aligned

input face. This warping process resizes the target

face in proportion to input face and warps its points to

achieve the alignment (Fig. 7).

Figure 7: Target image aligned to a common coordinate sys-

tem (left), aligned input image (middle) and target image

aligned to the aligned input image (right).

3.5 Face Swapping

After the alignment process, the size, orientation and

the shape of input face and target face correspond in

common coordinate system. At this point, the rela-

tive locations of the faces on the images are known,

but to complete the swapping process, the transforma-

tion between these locations needs to be calculated. In

order to calculate this transformation, ﬁrstly, bound-

ing box of shapes are located. Then, function which

matches the upper left corner of the bounding box

of the target shape with the upper left corner of the

REAL-TIME FACE SWAPPING IN VIDEO SEQUENCES - Magic Mirror

bounding box of the input shape is computed. The

transformation is obtained by applying this function

to all pixels of target face. Lastly, corresponding pixel

intensity values are copied to the output image. The

results of face swapping step are shown in Fig. 8.

Figure 8: Input image (left), target image (middle) and face

swapping result (right).

3.6 Face Blending as Post Processing

After face swapping process, face blending is per-

formed in order to smooth the transitions on face con-

tour. Face blending is the process of combining input

and target faces’ intensities with changing weights ac-

cording to pixel coordinates. On the face contour, the

weight of the input face is higher. The weight of tar-

get face is increased while getting closer to the center

of output face. Fig. 9 depicts the effectiveness of the

post processing step.

Figure 9: Face swapping without post processing (left) and

face swapping with post processing (right).

4 RESULTS

Some qualitative results produced by the system are

shown in Fig. 10 and 11 . The results are obtained by

selecting frames from the output real-time video of

the system. Fig. 12 shows frames taken from a 6 sec-

onds video stream produced by the system in order to

show the robustness of the face alignment to different

size, rotation and translation transformations during

the video. Please note that, it is difﬁcult to quanti-

tatively assess the performance of the developed sys-

tem, since it requires a signiﬁcant amount of facial

feature point annotations. As a matter of fact, evalua-

tion of face replacement results are mostly qualitative

by the problem’s nature. An example of quantitative

evaluation is employed in (Bitouk et al., 2008) by per-

forming a user study testing people’s ability to distin-

guish between real images of faces and those gener-

ated by their system. However, this evaluation strat-

egy cannot be applied here, since we deal with video

sequences rather than individual images and require

user to select the target image beforehand.

Figure 10: Some results of the system: target images (left)

and output images (right).

The system runs in real-time by processing faces

of 110 × 110 average spatial resolution in 640 × 480

sized frames. To obtain quantitative results some-

how and evaluate the real-time property of the system,

we measured average execution times of each internal

step. The results of the experiments conducted on an

Intel Core i5 2.5 Ghz computer are shown in Table 1.

Table 1: Average execution times of internal steps.

Internal Steps Average Execution Time (in ms)

Detection 173.22

Tracking 17.12

AAM ﬁtting 0.05

Alignment 21.66

Swapping and Blending 2.91

Face detection is the most time consuming process

of the system. With the use of the tracking mecha-

nism, face detection is not performed in each frame.

However, there is a trade-off between response time

and realistic face swapping. For lower values of the

error measure on model ﬁtting, re-initialization of the

tracker occurs more frequently for better face model

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

Figure 11: Example output sequence, each row shows three

screen shots belonging to a different celebrity selection. For

each frame in a row, note the differences in position, angle,

and size.

ﬁtting results. Each re-initialization means a new face

detection, therefore, average processed frame rate de-

creases. Average processed frame rate can be in-

creased by applying higher values of the error mea-

sure which results in poor ﬁtting of faces. In our

demo sessions, we observed that poor ﬁtting is a more

noticeable deﬁciency compared to low response time.

Especially, when lightening is insufﬁcient in the envi-

ronment, ﬁtting error increases and tracker becomes

useless. In such cases, even using continuous detec-

tion for tracking produces more realistic results.

Figure 12: Sample frames from a 6 seconds output video.

5 CONCLUSIONS

In this study, we developed a complete and fully-auto-

matic system to swap the face of the user with a face

selected by the user via the user interface of the pro-

gram. The images are acquired by a video captur-

ing device and face swapping process is applied on

these images. The system have a quite simple and

user-friendly interface for users to select a celebrity

from a list. After selecting the celebrity, the sys-

tem performs face swapping between the user and the

celebrity without requiring any manual initialization.

We used MCT face detector to locate frontal faces

and MCT eye and mouth detectors for the localiza-

tion of facial features. Detection results are fed to

the AAM ﬁtting mechanism as initial shape positions.

After the AAM ﬁtting process, best ﬁts for the input

and target faces are generated. Using these shapes, a

piecewise-afﬁne warping is applied with triangulated

mesh in which target shape is warped into the input

shape. Face swapping is performed by obtaining a

transformation function between aligned target face

and input face. Finally, we apply a post-processing

by using a weighted combination of intensity values

to blend in target face and input face. The system pro-

duces qualitatively nice results and is able to run in

real-time by processing frames of 640 × 480 spatial

resolution.

REFERENCES

Bitouk, D., Kumar, N., Dhillon, S., Belhumeur, P. N., and

Nayar, S. K. (2008). Face swapping: Automatically

replacing faces in photographs. ACM Transactions on

Graphics, 27(3):1–8.

Blanz, V., Scherbaum, K., Vetter, T., and Seidel, H.-P.

(2004). Exchanging faces in images. Computer

Graphics Forum, 23(3):669–676.

Cootes, T., Edwards, G., and Taylor, C. (2001). Active ap-

pearance models. IEEE Transactions on Pattern Anal-

ysis and Machine Intelligence, 23(6):681–685.

Gao, H. (2008). Face registration with active appearance

models for local appearance-based face recognition.

Master thesis, Universit

at Karlsruhe.

Gross, R., Matthews, I., and Baker, S. (2005). Generic vs.

person speciﬁc active appearance models. Image Vi-

sion Computing, 23(11):1080–1093.

Gross, R., Sweeney, L., Torre, F., and Baker, S. (2006).

Model-based face de-identiﬁcation. 22nd IEEE In-

ternational Conference on Data Engineering, (ICDE),

27(3):161–168.

Kublbeck, C. and Ernst, A. (2006). Face detection and

tracking in video sequences using the modiﬁed census

transformation. Image Vision Computing, 24(6):564–

572.

Nordstrm, M. M., Larsen, M., Sierakowski, J., and

Stegmann, M. B. (2004). The imm face database -

an annotated dataset of 240 face images. Technical

report, Technical University of Denmark.

REAL-TIME FACE SWAPPING IN VIDEO SEQUENCES - Magic Mirror