Optimal Camera Placement for 6D Head Pose Estimation

Harshita Soni, Nikhil Tirumala and Aratrik Chattopadhyay

Mercedes Benz R&D India, Bengaluru, India

Keywords:

Optimal Camera Placement, Visibility, Binary Integer Programming.

Abstract:

Multi-view systems for 6D head pose estimation have applications in human-computer interaction (HCI),

virtual reality, 3D reconstruction etc. In a multi-view system, visibility of facial landmarks is essential for ac-

curately regressing 2D landmarks, which are then triangulated to get 3D ﬁducials. From these 3D ﬁducials, the

6D head pose is mathematically derived. Optimal camera placement (OCP) is vital for achieving precise pose

estimation. OCP can be formulated as a constrained optimization problem that can be solved using Binary

Integer Programming. We redeﬁne two key aspects: the visibility criteria and the camera search space. Our

visibility algorithm employs a parametric head model to track ﬁducials, achieving more precise results than

ground truth of CMU(Carnegie Mellon University) Panoptic dataset. Additionally, we geometrically optimize

the camera search space, deviating from the baseline of uniformly arranged cameras. Through rigorous exper-

imentation, we prove that not only does this reﬁned search space reduce execution time, but also improves the

optimality of the solution, giving 99.9% visibility coverage. We also introduce a heuristic method that reduces

the constraint-building time from 27 seconds to just 0.07 seconds per control point, while maintaining concise

solutions with minimal effects on visibility metrics.

1 INTRODUCTION

6D head pose estimation includes three degrees of

freedom for rotation (yaw, pitch, roll) and three for

translation (X, Y, Z). Multi-view systems for 6D head

pose estimation have various applications across dif-

ferent ﬁelds. In AR/VR, head pose estimation en-

hances the immersive experience by ensuring virtual

objects align correctly with the user’s perspective. Se-

curity applications are equipped with multi-camera

surveillance systems where head pose estimation can

be used to analyze the behavior and intentions of in-

dividuals in surveillance footage. Additionally, such

systems can be used for ground truth data collection,

like the public dataset, CMU Panoptic [(Joo et al.,

2016)], using data captured to train deep learning

models.

6D head pose estimation is mathematically de-

rived from 3D landmarks of a set of critical points

in a human face, called ﬁducials, like eye corners,

mouth corners, nose-tip, etc. Fiducials form the back-

bone of most of the downstream functionalities of

such a multi-view system. Fiducials are triangulated

from their 2D landmarks in multiple camera views.

2D facial landmark regression from an image is a

well-known problem that has been addressed using

both traditional computer vision techniques and deep

learning methods. To summarize, the precision of an

end-to-end multi-view system depends on the avail-

ability of a certain minimum number of clear image

views for the 2D landmark regressor to regress on

conﬁdently. Hence, an optimal placement of cameras

is essential for the seamless functioning of a multi-

view system.

The concept of optimal camera placement (OCP)

was outlined by (Chv

atal, 1975) in the art gallery

problem in 1975. It aims to determine the minimum

number of guards (or cameras) needed to cover an en-

tire area.

OCP for surveillance applications does not require

tracking objects in 3D. In the case of OCP for 3D ob-

jects, in most use cases, the object is static and in a

ﬁxed, known position and orientation. In this work,

we solve OCP for an application where the target has

continuous changes in 6D pose. Also, we have a strin-

gent requirement of having the selected ﬁducials of

the object to be visible in all the 6D poses in at least

two cameras. This makes the OCP for a 3D-object-

in-motion a more challenging task.

Another aspect that complicates the visibility

computation in our case is that the human head is a

complex manifold. The visibility of the ﬁducials is

Soni, H., Tirumala, N. and Chattopadhyay, A.

Optimal Camera Placement for 6D Head Pose Estimation.

DOI: 10.5220/0013715100003982

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 22nd International Conference on Informatics in Control, Automation and Robotics (ICINCO 2025) - Volume 1, pages 82-92

ISBN: 978-989-758-770-2; ISSN: 2184-2809

sensitive to small changes in head position and orien-

tation, leading to self-occlusion.

The OCP can be solved using two main ap-

proaches: an iterative method, which is not al-

ways optimal, and binary integer programming (BIP),

which guarantees an optimal solution.

Finding an optimal solution using BIP is an NP-

hard problem. Optimization of any of the inputs to

BIP would reduce computation time and allow for

the quick arrival of a solution. This is overlooked

in most previous works, where they merely discretize

the search space into uniform grids. This work shows

that initializing the search space geometrically leads

to reduced computation time and an optimal solution.

We also show that our geometric approach makes the

solution generic, which can be adapted to any larger

3D environment.

The end-to-end pipeline could be computationally

intensive due to framing and solving the constraints.

In this work, we propose an alternate way to precom-

pute and approximate the visibility to save time dur-

ing execution. To summarize, the contributions of our

paper are:

1. We present an algorithm to check the visibility

of ﬁducials and model self-occlusion for a human

head.

2. We propose an initialization technique for cam-

era search space that drastically reduces the search

space and execution time and optimizes the solu-

tion.

3. We propose an approach to reduce the execution

time by approximating the visibility.

4. We show that camera solutions obtained from our

method can be ﬂexibly ﬁtted to any larger, arbi-

trarily shaped 3D environment.

5. We show the superiority of our method over the

baseline on multiple test metrics.

The paper reviews related work in section 2. Sec-

tion 3 covers the problem statement, including the vis-

ibility model (3.1), OCP formulation (3.2), and cam-

era search space optimization (3.3). Visibility approx-

imation is discussed in section 3.4. Methodologies

are evaluated in section 4, with experiments detailed

in section 5. The paper concludes with key ﬁndings

in section 6.

2 RELATED WORK

There has been a lot of work in the domain of optimal

camera placement (OCP) after its origin from the art

gallery problem [(O’Rourke, 1993)].

orster and Lienhart, 2006) has presented the op-

timal camera conﬁguration as an integer linear pro-

gramming problem, which can incorporate different

constraints and cost functions pertinent to a particular

application. They approximate the continuous camera

space by sampling the positions and poses. The visi-

bility model is an essential part of OCP, the deﬁnition

of which changes based on the application. [(H

orster

and Lienhart, 2006), (Bettahar et al., 2014)] deﬁned

visibility as the ﬁeld of view of a camera which is

taken to be a 2D ﬁxed-size triangle, while the visi-

bility in (Zhao et al., 2008) is based on the projected

length of the tag in the image plane. (Puligandla and

Lon

cari

c, 2022) ﬁgures the visibility of a control point

in a camera by checking its presence in the ﬁve planes

of the FoV pyramid of the camera. The visible point

analysis technique of (Zhang et al., 2021) is based on

a Hidden Point Removal (HPR) approach. Most of

these visibility models are applicable for tracking ob-

jects or tags, and some have been simpliﬁed to 2D.

Given our use case of 6D head pose estimation, track-

ing of 3D facial landmarks cannot be done by 2D-

based projection methods. In contrast, a 3D-based

method like HPR is applicable but susceptible to mis-

classiﬁcation errors around regions of high local cur-

vature.

[(Zhao et al., 2008), (Liu et al., 2014), (Betta-

har et al., 2014), (Puligandla and Lon

cari

c, 2022)]

uniformly divide the camera conﬁguration space into

grids and populate the camera search space by placing

cameras in those grids. (Zhang et al., 2015) adopts

the technique of local optimization of a single cam-

era, followed by iterative addition of cameras to cap-

ture the uncovered surfaces, not ensuring optimality

of the overall solution. (Zhang et al., 2021) employs

a genetic algorithm for global optimization of camera

conﬁgurations.

3 OCP FOR MULTI-VIEW

SYSTEM

3.1 Visibility Model

We use an off-the-shelf 3D head parametric model

that deﬁnes a human head and neck with N

ver-

tices and N

faces or triangles. The head model

is used to model visibility of a ﬁducial in a cam-

era. All the cameras are assumed to be pinhole.

Let P = {P

,...,P

} be the set of vertices, and

F = {F

,...,F

} be the set of triangles in the

head mesh. Fiducial points (N

) are some critical

points subsetted from P. Let K = {k

,...,k

}

Optimal Camera Placement for 6D Head Pose Estimation

Figure 1: Visibility of a ﬁducial in camera. Left eye inner

corner is visible in camera while right eye inner corner is

obstructed by nose.

be the indices of the ﬁducial points in P. Each

triangle F

is constructed from 3 vertices given by

vertex to f ace(P

) where i

∈ [1..N

Let C

be a camera in multi-view system. A 3D point

is said to be visible in C

1. P

is in the ﬁeld-of-view (FoV) of C

and

2. P

is not occluded by any of the triangles in F

when viewed in C

We model the self-occlusion of ﬁducial P

as an

aggregation of the intersection of

⃗

with all the tri-

angles of the mesh. We check the intersections using

the concept of barycentric coordinates.

The self-occlusion of a ﬁducial depends on the 6D

pose of the head. A 6D pose, Q

consists of 3 de-

grees of freedom for rotation R (yaw, pitch, roll) and

3 degrees of freedom for translation T (X, Y, Z). The

ﬂame mesh F is obtained by transforming a neutral

posed mesh N by a 6D pose Q

Let Intersection(P

) be a binary ﬂag repre-

senting the intersection status of

⃗

with triangle F

The intersection status of P

with F

in the triangle,

models self-occlusion and is given by -

sel f occl(P

) =

i=1

Intersection(P

∀F

∈ F − {vertex to f ace(P

)

s.t.i

̸= k,i

̸= k}

(1)

The visibility of a ﬁducial P

in a head trans-

formed by Q

from camera C

is given by

Vis(P

) = sel f occl(P

)

∧ FoV (F,C

)

(2)

where FoV (F,C

) checks the mesh F projected on

camera C

if it lies on its image plane.

Figure 2: FoV overlap.

3.2 Optimal Camera Placement

Let C denote camera search space of size N

where

∈ C is characterized by the 3D spatial location

) and orientation (yaw

, pitch

) of the

camera C

. We deﬁne the control space Q as a

set of all the head movements we want to capture

from multiple views, and eventually constrain the

optimization problem on it. Q

∈ Q is composed

of (X

,yaw

, pitch

,roll

) representing a

head movement. X is in horizontal axis, Y in vertical

axis and Z in depth axis. Control points are the ﬁdu-

cial points in the neutral ﬂame mesh transformed by

the control space. A control point can be indexed by

6D pose of the head Q

and ﬁducial P

Given the above formulation, the objective is to

minimize the number of cameras in the multi-view so-

lution so that each ﬁducial point is visible in at least 2

cameras. Hence, the camera placement problem can

be formulated as -

Minimize

∑

j=1

s.t.

(3)

∑

j=1

∗Vis(P

) ≥ 2,∀i = 1,2,...N

(4)

∑

at X,Y,Z

≤ 1

(5)

where b

is a binary variable to indicate the mem-

bership of camera C

in the solution. The ﬁrst con-

straint eq. 4 is deﬁned for N

ﬁducial landmarks.

In addition to the visibility constraints, spatial con-

straints to avoid placing two cameras (in different ori-

entations) at the same 3D spatial location are also

imposed. The objective is constrained on a total

of N

∗ N

(visibility) + |Unique(X

) ∈ C|

(spatial) constraints. This optimization problem can

be solved by binary integer programming (BIP). We

show empirically in section 3.3 that problem need

ICINCO 2025 - 22nd International Conference on Informatics in Control, Automation and Robotics

Figure 3: Derivation of step size of yaw in 2D.

need not be constrained on FoV if C is deﬁned as pro-

posed.

3.3 Optimization of Camera Search

Space

To geometrically optimize C, we solve for an arrange-

ment of cameras in which adjacent cameras share a

minimum degree of overlap between their coverage.

This criterion is driven by the need for a minimum of

2 views for triangulation. Spherical placement is the

optimal placement which we show empirically in the

following sections.

We assume all the candidate cameras are static

perspective cameras with the same horizontal and ver-

tical FoV. The camera lens has a minimum and max-

imum working distance within which the scene can

be captured with adequate focus and resolution. The

scene captured by any one camera is a frustum of a

quadrilateral pyramid.

The cameras can be placed on the surface of a

3D sphere constructed around the centroid of the

control space (Q) which is denoted by centroid.

,Yaw

,Pitch

) of C

are worked out

given the sphere’s radius (R), the ﬁeld of view of C

and the overlap of coverage desired between adjacent

cameras. Given the geometric structure of frustum,

ensuring that the head is covered at the nearest depth

) within the control space will automatically guar-

antee coverage at Z > Z

. The critical region (Cr) is

the width of the volume that needs to be covered by

a camera and is dependent on the use case. Let us

consider an x% overlap in the critical region between

adjacent cameras. This means the overlap at greater

depths within the control space will be > x%. Re-

garding the amount of overlap, as shown in ﬁg. 2, C

and C

can cover the ﬁducial points on the left side of

the head, while C

and C

can cover the right. A 50%

overlap would have sufﬁced if the head were static.

However, to ensure complete coverage of the ﬁdu-

cial points throughout the control space, we opt for an

overlap greater than 50%. This approach guarantees

ample options for the BIP to select from in the cam-

era search space. We determine the radius R based

on the largest sphere that can ﬁt within the 3D en-

vironment. A C derived from Cr and R evades the

need of explicitly constraining OCP on FoV. Consid-

ering all these factors, we derive the pose

step

in a di-

mension at which the cameras should be positioned

on the sphere, and the same method can be applied

to calculate the steps in other dimensions as well.

The pose

step

(yaw

step

, pitch

step

) can derive yaw

and

pitch

of cameras. X

can be easily indexed

as a point on the sphere using yaw

and pitch

. Al-

gorithm for building spherical camera search space

using calculated steps of yaw

step

and pitch

step

is given

in 2.

The coverage overlap computation at the critical

region of control space is done in 2D for simpliﬁca-

tion. A sample derivation of pose

step

required for R

radius, x% overlap on (Cr) in 2D in shown in ﬁg. 3.

In ﬁg 3, we solve for ∠C

. Here, the criti-

cal region (Cr) that must always be covered by C

, is

highlighted by AB. The minimum working distance

dist

) of camera C

is given by C

E and can be de-

rived based on Cr and f ov of C

. In 2D, the coverage

overlap between the cameras C

and C

is represented

by BD. The step size ∠C

is solved for by setting

BD to x % of AB which boils down to -

∠C

= 90 −

f ov

− cos

−1







R · cos



f ov





(x−50)·Cr

100



+ (R −W

dist

)







− tan

−1



(x − 50) ·Cr

(R −W

dist

) · 100



(6)

where W

dist

is given as

W dist =

· tan(90 −

f ov

) (7)

Note that the critical region (Cr), radius R and f ov

are coupled to each other and are not free variables.

As (Cr) increases, R may have to be increased to ac-

commodate the more exhaustive coverage.

Optimal Camera Placement for 6D Head Pose Estimation

Figure 4: Camera Search Space: Uniform vs Spherical. To

cover the same region, the uniform placement requires more

cameras than the virtual placement.

3.3.1 Why a Spherical Placement?

When creating the camera search space for an

arbitrary-shaped target 3D environment, intuitively,

we may place the cameras all over the surface as

shown by the outer polygon in ﬁg. 4, encapsulating

the region of interest. Naively placing cameras on the

surface or dividing the possible camera conﬁguration

space into uniformly spaced grids may lead to redun-

dancy of cameras in visibility space and may not en-

sure sufﬁcient camera options for the BIP.

The bigger the camera search space, the longer

the computation time for building the visibility con-

straints. However, a bigger camera space does not

necessarily ensure an optimal solution. This geomet-

ric placement of candidate cameras, derived from the

visibility requirements, ensures a relatively more op-

timal solution than the uniform placement.

3.3.2 Generalizability to a Bigger 3D

Environment

The camera solution derived from spherically placed

cameras is reusable. It can easily adapt to any larger

arbitrary 3D environment. A camera can be moved

along the line connecting the camera to the centroid,

which is also the camera’s optical axis in this sce-

nario. The camera’s depth can be scaled up spatially

(camera can be pushed back) from its original 3D lo-

cation as long as the face is captured at some min-

imum resolution needed for a good performance of

2D landmark detector. The concept has been quali-

tatively demonstrated with some examples and math-

ematical proof for the same is given in appendix 13.

We present the following reasons to justify that mov-

ing the cameras along their optical axis ensures max-

imum reusability-

1. The optical axis of all the cameras in the solution

intersect at the centroid, which is also the center

of the sphere. When the camera is scaled along

the optical axis:

Figure 5: Compensation of visibility on up-scaling of cam-

eras.

(a) There is no change in the head’s visibility lo-

cated at centroid. Only the size of the head in

the image changes, and hence it does not affect

the visibility of the ﬁducials.

(b) When the head is not at the centroid, the visibil-

ity is affected, but it is least compared to trans-

lating the camera in any other direction. In ﬁg.

6 , camera C

, a part of the camera solution,

can see points P

, P

and P

. Pushing C

′

can still see all 3 points. Pushing C

to C

will favor the visibility of P

and P

, however,

will lose P

. In addition to losing the visibil-

ity of P

, such translations of cameras will have

a cascading effect on the remaining cameras in

the solution. Computing equivalent transforma-

tions/translation of remaining cameras to com-

pensate for the loss of visibility of ﬁducials is

non-trivial. It is as bad as solving the binary

integer program all over again. A generaliz-

able method, like ours, must preserve the con-

straints satisﬁed by the original camera network

as much as possible.

2. Moreover, a single camera’s coverage might be af-

fected by the upscaling along optical axis. How-

ever, combined coverage of the original solution

from all the cameras is maintained in the upscaled

solution. For example, in ﬁg. 5, the left eye inner

corner was visible in C

which upon scaling C

′

is now unable to capture it due to obstruction by

nasian bridge. On the other hand, the same ﬁdu-

cial was originally out of FoV of C

, and is now

captured by C

′

3.4 Approximate Visibility

Building visibility constraints for a Q

requires com-

puting Vis(∗,Q

) with respect to all C

in C, which

is time-consuming and computationally intensive. We

propose that visibility need not be computed for

points at all depths and from all cameras. The visi-

ICINCO 2025 - 22nd International Conference on Informatics in Control, Automation and Robotics

Figure 6: Visibility preservation is maximal for a translation

along the camera’s line of sight.

bility ﬂags of the ﬁducial points can be precomputed

on a wide range of head positions and orientations

from one camera and at one depth (Z

f ixed

), and looked

up while building visibility constraints with respect to

the other cameras on the ﬂy.

Visibility is unaffected when P

′

= α ∗ P

, meaning

a ﬁducial P

is moved along the camera’s line of sight

by a scale of α. If any other point P

can be expressed

as a scaled version of P

, the visibility of P

will be the

same as that of P

. Readers may refer to the appendix

13 for proof.

Every 3D point can be represented as a scaled ver-

sion of a point at a ﬁxed depth Z

f ixed

. Hence, the

visibility of control points need not be computed at

all depths, as long as we have the visibility ﬂag of

a scaled version of P

at Z

f ixed

available. If the ex-

act match is unavailable in the visibility lookup ta-

ble LU T

vis

, the visibility ﬂag of the nearest control

point is used. This approach can be used in cases

where a certain degree of approximation is acceptable

to trade off with the runtime computation. To reduce

the approximation, LUT

vis

must be populated with a

ﬁner granularity and broader ranges of head positions

(X,Y ) and orientations (yaw, pitch, and roll).

A LUT

vis

is populated with visibility ﬂags

for N

ﬁducials at (|yaw r anges| ∗ |pitch ranges| ∗

|roll ranges|) head orientations and (|X ranges| ∗

|Y ranges| head positions computed at Z = Z

f ixed

A step-by-step algorithm for obtaining (approximate)

visibility of P

from C

given a LU T

vis

is shown in

algorithm 1.

4 EVALUATION

To evaluate the effectiveness of our approach, we de-

ﬁne the metrics as follows -

1. Conciseness(N

): The fewer cameras in the solu-

tion, the more concise the multi-view solution.

2. Test Visibility Metric(η): The control points that

failed to be captured by two or more cameras are

categorized as Failure. Test visibility is deﬁned

as -

η = 1 −

Failure

test

(8)

where N

test

is the size of test control space.

3. Constraint Building Time (T

vis

) : Time complexity

for building visibility constraints is O(N

∗ N

∗

) as there are N

visibility constraints for each

of the N

ﬁducial points. Building each constraint

requires visibility computation from each of N

cameras.

4. Camera Exposure Rate (β): We deﬁne the cam-

era exposure rate of a ﬁducial point as the average

number of cameras in the multi-view solution that

can capture the ﬁducial point.

Algorithm 1: Algorithm depicting the proposed approxi-

mate visibility computation of P

from camera C

Input: Visibility Lookup

LUT

vis

[yaw ranges,

pitch ranges,rol l ranges,X ranges,Y ranges]

Control Point

: yaw

, pitch

,roll

Camera C

: yaw

, pitch

Depth Z

f ixed

Rotation matrix to euler angles : euler()

Euler angles to rotation matrix : rot mtx()

Output: Approximate visibility of P

wrt C

= [X

]

= [X

]

= rot mtx(yaw

, pitch

,roll

)

= rot mtx(yaw

, pitch

,0)

p,c

= R

yaw

p,c

, pitch

p,c

,roll

p,c

= euler(R

p,c

)

p,c

= (t

− t

).R

p,c

= t

p,c

′

p,c

= (Z

f ixed

p,c

) ∗ X

p,c

′

p,c

= (Z

f ixed

p,c

) ∗Y

p,c

yaw

′

= argmin

∈yaw ranges

− yaw

p,c

pitch

′

= argmin

∈pitch ranges

− pitch

p,c

roll

′

= argmin

∈roll ranges

− roll

p,c

′

= arg min

∈X ranges

− X

p,c

′

= argmin

∈Y ranges

−Y

p,c

vis = LUT

vis

[yaw

′

][pitch

′

][roll

′

][X

′

][Y

′

]

return vis;

The mathematical proof showing the correctness

of the proposed approximate visibility ﬂags from pre-

computed visibility is given in the appendix 6.

Optimal Camera Placement for 6D Head Pose Estimation

Table 1: Performance of experiments with different conﬁgurations of camera search space.

Conﬁguration Radius N

↓ η ↑ Failure↓ β ↑ T

vis

(s/iter)↓

Baseline Uniform Placement of Camera - 5416 15 0.854 111529 3 133

Our

Method

Default 500 325 31 0.998 1113 11 12

Diagonal(20

◦

,20

◦

) 500 975 20 0.998 1320 8 27

Translational(100) 500 975 19 0.998 839 8 27

Diagonal & Translational(20

◦

,20

◦

,100) 500 1625 18 0.998 1247 7 40

Default 800 325 10 0.999 477 5 12

Diagonal(20

◦

,20

◦

) 800 975 9 0.999 195 5 27

Translational(200) 800 975 9 0.999 296 5 27

Diagonal & Translational(20

◦

,20

◦

,200) 800 1625 9 0.999 309 5 40

Figure 7: Example images from CMU Panoptic Dataset.

The ﬁrst row shows visible ﬁducials according to the

dataset. The second row shows the visible ﬁducials accord-

ing to our algorithm. Non-visible(incorrect) ﬁducials are

colored red.

5 EXPERIMENTS

5.1 Visibility Algorithm

The head mesh model, FLAME[(Li et al., 2017)]

is used to model self-occlusion of ﬁducials. It has

= 5023 vertices and N

= 9976 triangles. We show

the precision of the visibility algorithm on a public

dataset, CMU Panoptic Dataset [(Joo et al., 2016)], a

multi-view dataset captured in a 3D environment with

31 HD cameras. It has 3D facial landmarks and cor-

responding visibility ﬂags from all the cameras. We

compare the ground truth visibility ﬂags of the fol-

lowing ﬁducials from the dataset with visibility com-

puted from our method- Left Eye Inner Corner, Left

Eye Outer Corner, Right Eye Inner Corner, Right Eye

Outer Corner, Left Mouth corner, Right Mouth cor-

ner, Nose Tip, Chin. ﬁg. 7 shows the qualitative re-

sults.

5.2 Spherical Placement of Candidate

Cameras

Coordinate System: In our experiments, we have as-

sumed the X-axis to the right, Y-axis downwards, Z-

axis into the screen.

We solve the problem using a small control space

of size N

= 768 and test the solution on a 3X ﬁnely

sampled control space of size N

test

96000. The 6D

ranges of runtime control space and test control space

are deﬁned in Table 2. As for ﬁducial points, we con-

strain OCP on the visibility of all the ﬁducials men-

tioned in section 5.1. We do not add FoV constraints

(eq. 4) as they are implicitly satisﬁed by all the can-

didate cameras. All experiments use an open-source

CBC solver[(Forrest et al., 2024)].

Baseline: Assuming the scene is set in a rectangular

room, we imitate the baseline by placing 5416 cam-

eras in uniform grids on the front and side walls, with

the person’s head directed towards the front wall.

Our Method: Camera search space is created as de-

scribed in section 3.3. In 6, the horizontal and verti-

cal f ov is set to 90

◦

,hence, yaw

step

= pitch

step

. We

present results for two radii: 500mm and 800mm. For

the experiments with R = 500mm, the Cr is set to

200mm with an overlap of x = 60%, resulting in a

pose

step

of 14.11

◦

. For the R = 800mm, the Cr is

set to 400mm with an overlap of x = 75%, yielding

pose

step

of 13.96

◦

. In both cases, the step sizes are

approximated to 15

◦

. As stated earlier, Cr,R and f ov

are interrelated, implying that the geometry in ﬁg. 3

may change if the variables are geometrically incon-

sistent. For instance, targeting a Cr of 400mm with

radius of 500mm will modify the geometry of ﬁg. 3

and in turn the eq. 6. Experiments with optimized

search space are performed in 4 different augmenta-

tion settings -

1. Default Conﬁguration - Placing the cameras over

the sphere at steps of 15

◦

yaw and pitch.

2. Diagonal Augmentation (aug

yaw

,aug

pitch

) -

(a) Default Conﬁguration

ICINCO 2025 - 22nd International Conference on Informatics in Control, Automation and Robotics

(b) 2 additional cameras oriented at

aug

yaw

,aug

pitch

• yaw

+ aug

yaw

, pitch

+ aug

pitch

• yaw

− aug

yaw

, pitch

− aug

pitch

3. Translational Augmentation (aug

trans

) -

(a) Default Conﬁguration

(b) 2 additional cameras translated at

• X

+ aug

trans

• X

− aug

trans

4. Diagonal & Translational Augmentation

(aug

yaw

,aug

pitch

,aug

trans

) -

(a) Default Conﬁguration

(b) 2 additional cameras rotated and translated at

• yaw

+ aug

yaw

, pitch

+ aug

pitch

• yaw

− aug

yaw

, pitch

− aug

pitch

• X

+ aug

trans

• Y

− aug

trans

In Setting 2, the default conﬁguration is enhanced

by orienting the cameras diagonally at speciﬁc angles

from their original poses. In Setting 3, the default

setup is enhanced by cameras translated along a spec-

iﬁed dimension. These augmentations aim to provide

BIP with more options, if needed. The results from

these augmented settings closely resemble those of

the default conﬁguration, highlighting the effective-

ness of our approach. These augmentations represent

a trade-off between the conciseness of the solution

and the computational complexity of the search space,

allowing users to choose based on their speciﬁc use

case.

Our extensive experiments demonstrate that the

multi-view solution achieved through optimized ini-

tialization of the camera search space outperforms

the solution obtained from uniform camera placement

across all test metrics. The results in Table 1 indi-

cate that our method, with or without augmentation,

consistently achieves a high test visibility score (over

99%) while utilizing a much smaller camera search

space than the baseline.

A more concise camera search space also leads to

signiﬁcantly reduced execution times compared to the

baseline. A low test visibility score is associated with

a poor-quality solution as it comes with a higher fail-

ure rate at intermediate control points. Although re-

sults from some of our experiments are not as com-

pact as those of the baseline solution, such as ones

with R = 500mm, solutions are still more desirable

due to their higher η, better β, and lower T

vis

Our best outcome, featuring a Diagonal (20, 20)

conﬁguration with R = 800mm, is visualized in ﬁg. 8.

The camera solution offers ≈ 14.5% increment on η

with 6 lesser cameras than the baseline solution. On

average, our method achieves a β of 5 or higher, com-

pared to the baseline score of 3.

Table 2: Range of runtime and test control space.

Dimension min max runtime step test step

yaw -90 60 30 10

pitch -60 30 30 10

roll -60 30 30 10

X 0 200 200 50

Y -400 -200 200 50

Z 300 500 200 50

5.3 Generalizability of Camera Solution

To simulate ﬁtting to an arbitrary-shaped 3D environ-

ment, we upscale the cameras from the existing so-

lution by random factors. As seen in Table 3, the

count of test control points failing to be tracked by

the environment-adapted solution (Failure

adapted

) is

less than or equal to failures of the original solution

(Failure

original

The experimental results in Table 1 and 3 high-

light the optimality of spherical placement strategy

and adaptability of the solutions, respectively.

Table 3: Performance of environment-adapted camera solu-

tion vs original solution.

Failure

original

↓ Scale Range Failure

adapted

↓

1113 1.5-2.0 0

1247 1.0-1.5 121

195 1.2 - 1.6 61

5.4 Camera Placement from Proposed

Approximated Visibility

To test our proposed approximated visibility as ex-

plained in 3.4, we experiment with 2 LUT

vis

of dif-

ferent granularities. Both the lookups are populated

with visibility ﬂags of the required ﬁducials in the

range of: yaw [−180

◦

,180

◦

], pitch [−90

◦

,90

◦

], roll

[−180

◦

,180

◦

] at steps of 10

◦

, 10

◦

, 10

◦

, respectively,

and X and Y in range of [−700,500]mm, much big-

ger than runtime Q [2]. The visibility for the above

range of poses is calculated at Z

f ixed

= 500mm. The

lookup C-50 has visibility ﬂags stored at spatial gran-

ularity (X and Y ) of 50mm, C-25 at 25mm. As can

be noticed from the results in Table 4, both solutions

include a much smaller number of cameras but more

invisible test control points than the original solution.

The C-50 conﬁguration exhibits greater sparsity in the

X and Y dimensions than the C-25 conﬁguration, re-

sulting in the solver using less precise visibility ﬂags

than those used in the C-25 conﬁguration. Compu-

tationally, the lookup creation can be treated as pre-

Optimal Camera Placement for 6D Head Pose Estimation

Figure 8: A sample camera solution from our method a) Top View, b) Front View, c) Side View.

processing and utilized for multiple integer programs

constrained on various deﬁnitions of Q. The trade-off

between the performance of a multi-view system and

its conciseness highly depends on the nature of the

application.

6 CONCLUSION

This work revisits the OCP for 6D head pose esti-

mation where we propose an optimized initialization

of camera search space and redeﬁne visibility of 3D

points. Furthermore, as an alternative to the computa-

tionally intensive task of calculating visibility for all

ﬁducials from every camera, we introduce an algo-

rithm for approximate visibility computation. As a

future work, the solution can be extended to be de-

rived from a candidate set of varied focal lengths and

PTZ cameras.

Table 4: Performance with Approximated Visibility.

Lookup Conﬁguration η ↑ N

↓ t

vis

(s/iter) ↓

Original 0.998 18 27

C-25 0.878 10 0.07

C-50 0.716 10 0.07

REFERENCES

Bettahar, H., Morsly, Y., and Djouadi, M. S. (2014).

Optimal camera placement based resolution re-

quirements for surveillance applications. In 2014

11th International Conference on Informatics in

Control, Automation and Robotics (ICINCO),

volume 01, pages 252–258.

Chv

atal, V. (1975). A combinatorial theorem in plane

geometry. Journal of Combinatorial Theory, Se-

ries B, 18(1):39–41.

Forrest, J., Ralphs, T., Vigerske, S., Santos, H. G.,

Forrest, J., Hafer, L., Kristjansson, B., jpfasano,

EdwinStraver, Jan-Willem, Lubin, M., rlougee,

a andre, jpgoncal1, Brito, S., h-i gassmann,

Cristina, Saltzman, M., tosttost, Pitrus, B., MAT-

SUSHIMA, F., Vossler, P., SWGY, R. ., and to st

(2024). coin-or/cbc: Release releases/2.10.12.

orster, E. and Lienhart, R. (2006). On the opti-

mal placement of multiple visual sensors. In

Proceedings of the 4th ACM International Work-

shop on Video Surveillance and Sensor Net-

works, VSSN ’06, page 111–120, New York,

NY, USA. Association for Computing Machin-

ery.

Joo, H., Simon, T., Li, X., Liu, H., Tan, L., Gui,

L., Banerjee, S., Godisart, T., Nabbe, B. C.,

Matthews, I. A., Kanade, T., Nobuhara, S., and

Sheikh, Y. (2016). Panoptic studio: A massively

multiview system for social interaction capture.

CoRR, abs/1612.03153.

Li, T., Bolkart, T., Black, M. J., Li, H., and Romero,

J. (2017). Learning a model of facial shape and

expression from 4d scans. ACM Trans. Graph.,

36(6).

Liu, J., Sridharan, S., Fookes, C., and Wark, T.

(2014). Optimal camera planning under versa-

tile user constraints in multi-camera image pro-

cessing systems. IEEE Transactions on Image

Processing, 23(1):171–184.

O’Rourke, J. (1993). On the rectilinear art gallery

problem. Computational Geometry, 3(1):53–58.

Puligandla, V. A. and Lon

cari

c, S. (2022). A mul-

tiresolution approach for large real-world cam-

era placement optimization problems. IEEE Ac-

cess, 10:61601–61616.

Zhang, H., Eastwood, J., Isa, M., Sims-Waterhouse,

D., Leach, R., and Piano, S. (2021). Optimi-

sation of camera positions for optical coordi-

nate measurement based on visible point anal-

ysis. Precision Engineering, 67:178–188.

Zhang, X., Chen, X., Alarcon-Herrera, J. L., and

Fang, Y. (2015). 3-d model-based multi-camera

deployment: A recursive convex optimization

approach. IEEE/ASME Transactions on Mecha-

tronics, 20(6):3157–3169.

ICINCO 2025 - 22nd International Conference on Informatics in Control, Automation and Robotics

Zhao, J., Cheung, S.-C., and Nguyen, T. (2008). Op-

timal camera network conﬁgurations for visual

tagging. IEEE Journal of Selected Topics in Sig-

nal Processing, 2(4):464–479.

APPENDIX

Translating the head without any change in rotation

changes the view of the object in the camera, affect-

ing the visibility of the ﬁducial points. Similarly, ro-

tating the object without translating it also affects the

visibility of the ﬁducials. However, there is a special

case of translating the object, so the visibility is un-

affected. This happens when the object is translated

along the camera’s line of sight. Given an image I

of a head translated at a 3D location H = [H

]

in camera C

, when moved along the camera’s line of

sight to the new position H

′

= α ∗ H, the correspond-

ing image I

′

∗ I about the 2D projection of H in

For proof, let the camera projection matrix and a

perspective projection of point X rotated and trans-

lated by R and T, respectively, be given as

K =





0 c

0 f

0 0 1





, pro j(X) = K[RX + T ] (9)

For simplicity, let us assume the translation of the

head is done with respect to a central facial point

which we address by head center H. In action, all

the points in the head are rotated ﬁrst and then trans-

lated. Hence, in neutral pose, the head center will

be at [0,0,0], i.e., the origin of the coordinate sys-

tem. Any amount of rotation applied to the head cen-

tre at [0, 0, 0] will keep the 3D location of the head

unchanged at [0,0,0]. Hence,

pro j(X) = K[RX + H] = pro j(H) (10)

when X is the head center in neutral pose.

Proof 1 : When H

′

→ α ∗ H, pro j(H) = pro j(H

′

)

Solving LHS,

pro j(H) = K[H] =

∗

+ c

∗

+ c

(11)

Solving RHS,

pro j(H

′

) = K[α ∗ H] =

∗

α∗H

+ c

∗

α∗H

+ c

(12)

Proof 2 : When H

′

→ α ∗ H, I

′

→

∗ I

Let the facial landmarks be denoted by P =

,...,P

] where P

= [P

] (we will work

with one sample ﬁducial point, for simplicity denoted

by P). Upon scaling, H

′

= α∗H, P

′

= P +(α−1)∗H

Knowing that an image is a set of 2D points, we

consider the distance between any two known 2D

points to compute the equivalent transformation to be

applied to the image.

Proof boils down to,



pro j(H

′

) − pro j(P

′

)



∥

pro j(H) − pro j(P)

∥

(13)

Solving RHS,



+ c



−



+ c





+ c



−



+ c





−



+ f



−



(14)

Solving LHS,

pro j(P

′

) = pro j(P + H

′

− H)

+(α−1)H

)+c

+(α−1)H

)

+(α−1)H

)+c

+(α−1)H

)

+(α−1)H

(15)

To simplify the above equation, we make a realistic

assumption of H

≈ P

, meaning the facial landmarks

are approximately at the same depth as the head

center H. With this assumption, LHS becomes -



+ c

− f

+ (α − 1)H

)

αH

+ c





+ c

− f

+ (α − 1)H

)

αH

+ c





− f

+ (α − 1)H

)

αH





− f

+ (α − 1)H

)

αH





−



+ f



−



(16)

Optimal Camera Placement for 6D Head Pose Estimation

Algorithm 2: Spherical Placement of Cameras.

Input: yaw

step

, pitch

step

, R, centroid

Output: Camera Search Space C

centroid

,centroid

= centroid;

C = {};

yaw = 0

◦

;

pitch = 0

◦

;

while yaw ≤ 360

◦

while pitch ≤ 360

◦

yaw

, pitch

= −yaw,−pitch

centroid

+ R · cos(pitch) · sin(yaw)

= centroid

− R · sin(pitch)

centroid

− R · cos(pitch) · cos(yaw)

= (X

,yaw

, pitch

)

Add C

in C

yaw ← yaw + yaw

step

pitch ← pitch + pitch

step

end

return C;

ICINCO 2025 - 22nd International Conference on Informatics in Control, Automation and Robotics