Optical-inertial Synchronization of MoCap Suit with Single Camera

Setup for Reliable Position Tracking

Adam Rie

cick

2,4

, Martin Madaras

1,2,4

, Michal Piovar

3,4

and Roman

Durikovi

Institute of Software Technology and Interactive Systems, University of Technology, Vienna, Austria

Department of Applied Informatics, Comenius University, Bratislava, Slovakia

Saarland University, MMCI Saarbr

ucken, Germany

Skeletex Research, Bratislava, Slovakia

Keywords:

Motion Capture, Inertial Suit, Optical System, Optical-inertial Synchronization, Silhouette Tracking.

Abstract:

We propose a method for synchronization of an inertial motion capture suit and a single camera optical setup.

Proposed synchronization is based on an iterative optimization of an energy potential in image space, mini-

mizing the error between the camera image and a rendered virtual representation of the scene. For each frame,

an input skeleton pose from the mocap suit is used to render a silhouette of a subject. Moreover, the local

neighborhood around the last known position is searched by matching the silhouette to the distance transform

representation of the camera image based on Chamfer matching. Using the combination of the camera tracking

and the inertial motion capture suit, it is possible to retrieve the position of the joints that are hidden from the

camera view. Moreover, it is possible to capture the position even if it cannot be captured by the suit sensors.

Our system can be used for both real-time tracking and off-line post-processing of already captured mocap

data.

1 INTRODUCTION

Considering the problem of person tracking and

movement analysis, there are a lot of motion capture

(mocap) solutions available in both academia and in-

dustry. Mocap is a widely used technique for storing

and processing movement data of the person. A reli-

able motion capture and tracking is necessary in the

ﬁlm and games industry, virtual reality, biometrics or

even healthcare. Optical-based tracking is more prob-

lematic in the tracking area if it has occlusions. More-

over, the tracking area might be very non-convex, and

therefore non-coverable by optical-based mocap and

tracking systems. In such a case, it is suitable to

use non-optical methods for tracking; however, some

other limitations appear, such as drifting, calibration

and synchronization problems or additional noise in

the captured data. Sometimes, these limitations are

solved by a post-processing of raw data using com-

plex probabilistic models that have to be trained on

reliable training datasets, which might be impossible

to obtain in a given situation.

Nowadays, there are a variety of mocap systems

suitable for recording of the body movements. There

are two major groups of mocap systems, optical-

based and inertial-based systems. Each group has its

own advantages and limitations. The advantages of

inertial systems are the ﬂexible capture area (outdoor

capture, water capture), occlusion independence, fast

setup time, transferability and the possibility of direct

use of the raw output data for a 3D model. The biggest

disadvantage is that one can get only rotation data of

each skeleton joint. The joint positions in 3D space

have to be calculated based on the calibration process

and the root position estimation, which have to be ap-

proximated by a walking algorithm implemented in

the mocap software. On the other hand, the optical

systems are limited to indoor use only. They have

problems with occlusions and cannot directly return

the information about the joint rotation around the

bone axis. The individual joint position can be tracked

easily; however, the rotations need to be calculated in

the next evaluation stage. In order to solve the posi-

tioning disadvantage of inertial systems, solutions us-

ing radio/NFER or ultrasound positioning have been

proposed. However, these systems are in general hard

to calibrate and synchronize.

Furthermore, in our project, we need to obtain

reliable position tracking of the inertial suit using a

commodity RGB camera. The whole system should

Rie

cický, A., Madaras, M., Piovarci, M. and Durikovic, R.

Optical-inertial Synchronization of MoCap Suit with Single Camera Setup for Reliable Position Tracking.

DOI: 10.5220/0006531100400047

In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 1: GRAPP, pages

40-47

ISBN: 978-989-758-287-5

be affordable and it should be compatible with both

outdoor and indoor usage.

Therefore, in our project we propose a hybrid

optical-inertial system. In this system, the mocap in-

ertial suit is combined and synchronized with a single

camera. Once the camera is calibrated and the rela-

tive position and orientation are calculated, it can be

used for real-time effortless position estimation. The

hybrid system does not require a training phase and

has advantages over both optical and inertial motion

capture systems. Other hybrid tracking systems need

either a complicated setup or are much more expen-

sive. Our system requires only an inertial suit and a

single RGB camera.

2 RELATED WORK

Inertial Suits. There are several inertial IMU suits

available on the market: a 3DSuit by Inertial Labs, an

IGS Cobra suit by Synertial, an MVN suit by XSens

and a Perception Neuron suit by Noitom. The suits

differ in sensor conﬁguration, price and precision. For

example, the suits from XSens and Synertial have a

higher number of sensors and can stream raw data for

all the sensors. The Perception Neuron suit is a cheap

and affordable solution for the general public, with a

smaller set of sensors. Afterwards, the streamed data

available to the reader are interpolated from the raw

sensor data.

Optical Mocap. Optical systems can be divided

into two main groups: systems based on passive retro-

reﬂective or active markers, and marker-less mocap

systems that are trained on a set of training images.

The marker-based systems are able to perform

with much higher accuracy. In practice, optical sys-

tems and suits with markers are used, e.g. OptiTrack,

Vicon. A group of multi-view RGB-only-based mo-

cap systems working without a training stage exists,

using shape from silhouette or sums of spatial Gaus-

sians (Cheung et al., 2003; Stoll et al., 2011). These

optical systems usually need a complicated setup and

multiple cameras.

The trained probabilistic marker-less techniques

(Wojek et al., 2009; Andriluka et al., 2010) that work

on RGB images are not very precise in general. Thus,

they are typically used for academic research testing

only, or they have to be fused with inertial sensors as

in (Pons-Moll et al., 2010). Lately, a helmet with two

ﬁsheye RGB cameras was proposed in (Rhodin et al.,

2017) for the motion capture of a subject wearing the

helmet. The system can only capture motion of the

skeleton; this cannot be used for position tracking.

Probabilistic optical approaches can be trained di-

rectly on depth values obtained by an RGBD camera,

for example Kinect. The Kinect is mostly used for

real-time pose estimation (Shotton et al., 2013). This

probabilistic skeleton estimation is not very precise,

but is well suited for the fun real-time applications

where Kinect tracking is mostly used. Moreover, the

Kinect can be used for both real-time skeleton estima-

tion and surface reconstruction using Kinect Fusion

(Izadi et al., 2011). Depth values from the RGBD

camera can be used for point cloud reconstruction

and the skeleton can be extracted from a point cloud.

However, this process is too slow for real-time mo-

tion capture. Nevertheless, it can be used for body

size estimation and calibration from a single scan.

These data can be used to improve motion capture

data (Anguelov et al., 2005).

Moreover, the probabilistic optical-based systems

are trained on RGB or RGBD images and estimate

position in 3D space based on probabilistic models

(Shotton et al., 2013; Andriluka et al., 2010). An op-

tical ﬂow based on Chamfer matching can be used to

track the subject without a training stage (Dimitrijevic

et al., 2006; Katz and Aghajan, 2008). These methods

can be used directly on the input frames; however, a

background subtraction is a necessary preprocessing

step to obtain robust tracking results.

An extensive comparison of inertial and optical-

based motion capture suits can be found in (Skogstad

et al., 2011).

Person Tracking. Similarly, as in the case of mocap

solutions, the tracking can be optical based or approx-

imated using triangulation of distances to the signal

source, e.g. GPS. The lighthouse tracking by Valve

is part of HTC Vive and it is based on a measuring

of time delay between emitting a ﬂash and sweeping

a beam of light across the room. The receivers cal-

culate the time delay between these two light ﬂashes

and using a simple trigonometry the 3D position of

the receiver can be evaluated in real-time.

Hybrid Systems. Several hybrid approaches were

published in recent years. The hybrid systems for

skeleton and position tracking are based on a fusion

of the IMU orientation data and some other sensor. In

(Ziegler et al., 2011), the subject in an inertial suit is

tracked by a robot with a laser scanner. Such a combi-

nation can track the subject’s position and trajectory

in big areas; however, it might be impossible to use

the robot in small interiors and the robot is too expen-

sive tool for common usage. A fusion of multi-view

RGB cameras with few IMUs was proposed in (Pons-

Moll et al., 2010; von Marcard et al., 2016). These

Optical-inertial Synchronization of MoCap Suit with Single Camera Setup for Reliable Position Tracking

approaches for fusion give very good results; how-

ever, the fusion needs a scanned template model of

the subject and the system needs multiple RGB cam-

eras in order to correctly ﬁt the template into the sil-

houette. A combination of discriminative and gener-

ative tracking using a depth sensor was used in (Hel-

ten et al., 2013). The approach also needs a template

mesh model and the RGBD camera has a very lim-

ited volume where the fusion works precisely enough.

In general, the mentioned related hybrid approaches

either need a much more complicated and expensive

setup (multiple cameras, depth camera, robot), or they

have a much more complex tracking pipeline than our

approach (template mesh scanning, non-linear energy

minimization, training stage).

3 OPTICAL-INERTIAL

SYNCHRONIZATION

The main idea behind the optical-inertial tracking so-

lution of the suit and the camera is determining the 3D

position of the actor from his silhouette in the camera

image based on his actual pose. Knowing the actor’s

skeleton pose from the suit in real-time, we are able to

predict the body shape we are looking for within the

camera image. First, a base mesh is constructed using

the actor’s speciﬁc parameters such as height or lo-

cal diameters. This mesh is then used for rendering a

shape which is similar to the actor’s silhouette in the

image. A virtual camera which is used for the base

mesh rendering needs to see the scene the same way

as a real camera sees the scene with actor; therefore,

it needs to be calibrated.

The rendered base mesh silhouette is then used to

search the local neighborhood of the last known po-

sition of the subject in the next image frame. Min-

imizing the energy composed of spatial integration

of Chamfer matching error in the image space, we

are able to perform real-time tracking of the subject.

During the tracking, a 3D virtual scene is rendered

and matched to the camera image; therefore, if it is

matched with the precisely calibrated camera setup,

we are able to directly estimate the 3D position of the

subject in the real world.

3.1 System Overview

The whole tracking system is composed of three

phases: a calibration phase, a tracking start-up phase

and an iterative tracking phase. The ﬁrst two phases

are used for the initial setup only to determine and to

correctly represent the real world in the tracking sys-

tem; therefore, the third phase is the actual tracking

stage.

The calibration phase needs to be done only once,

or when the camera is replaced. This step is required

to acquire correct camera parameters. The parameters

can be saved and reused before each tracking session.

The second stage, the tracking start-up phase,

needs to be performed at least once before each ses-

sion, to synchronize the real-world camera with the

virtual camera of a system, and to specify the actor’s

starting location for the tracking. However, there is a

possibility to assign these properties during the itera-

tive tracking phase, seamlessly without the interrup-

tion of the tracking procedure.

Finally, the tracking phase is iteratively performed

during the whole remaining tracking time. Output of

this stage is the true 3D position of the actor in both,

the virtual scene and the real world.

3.2 Initial Setup and Calibration

For the camera calibration, an OpenCV with its built-

in modules is used. We used the ChArUco module

that uses a hybrid checkerboard for both camera cali-

bration and camera position estimation, as can be seen

in Figure 1. Given several pairs of point correspon-

dences of calibration patterns in the real world and

in the image space, it is possible to ﬁnd intrinsic and

extrinsic parameters of the camera.

Figure 1: The ChArUco calibration board. The ChArUco is

a combination of the classical calibration checkerboard and

the ChArUco alphabet signs.

Next, the body size of the actor needs be measured

manually using a ruler, or in an automatic way using

a Kinect or calibrated RGB camera. The measured

body height and body radii are then used together with

the mocap position to construct a base mesh approxi-

mating the body of the subject.

3.3 Virtual Scene Creation

The main idea behind the synchronization is to create

a virtual scene according to the parameters acquired

in the real world. This step is called the tracking start-

up phase, and it must be generally executed before

each tracking session, when some scene properties are

changed, e.g. camera is moved, camera is changed,

starting position is changed, etc.

GRAPP 2018 - International Conference on Computer Graphics Theory and Applications

Figure 2: A model-view matrix that consists of a rotation

and a translation of the ChArUco board into the camera im-

age is obtained in the tracking start-up phase. Thus, the

origin of the virtual scene is set into the ChArUco board

location and the camera position and orientation are set ac-

cordingly.

Figure 3: A virtual scene is constructed based on the camera

calibration, calculated scene parameters and current mocap

data. Using either the real-time mocap data or stored off-

line mocap data, a silhouette of the tracked subject is ap-

proximated and later used during the tracking phase.

In order to get the origin of the virtual space, a

ChArUco board marker (see Figure 2) is placed in

front of the camera. As an input it takes correspond-

ing points together with board parameters and as an

output it produces the rotation and translation trans-

formations that give us a model-view matrix of the

checkerboard in world space coordinates. Having the

ChArUco marker detected, we are able to estimate the

camera position and orientation relative to the origin.

The second purpose of placing a ChArUco board into

the scene is to deﬁne the starting position of the ac-

tor (the actor starts in the origin of the virtual scene).

This is the position where the tracking starts. The

precision of the tracked position relies deeply on the

camera calibration and proper virtual scene setup (see

Figure 3). If the ChArUco board cannot be used for

some reason, the starting position and camera param-

eters can be always set manually.

3.4 Silhouette Image Database

Construction

The camera image contains objects and subjects

which are not important for the system. The goal of

the system is to locate and track only the actor dressed

in the motion capture suit. Therefore, the pose data

from the suit are used to determine this. The reader of

the suit rotations is able to stream local transforma-

tions for each frame in real-time.

Figure 4: A base mesh is created using the skeleton acquired

from the mocap suit. The skeleton is enhanced with mea-

sured radii of the actor’s body. Note that the base mesh

construction process is depicted in 2D only.

Firstly, a shape that roughly represents the actor’s

body is needed. Here, it might be possible to use a

broad set of shapes, from primitives roughly approxi-

mating the body to a high-detailed 3D scan. However,

we choose to create a simple base mesh approximat-

ing the body shape mesh from the input skeleton, be-

cause it is easily customizable, scalable and can be

generated in real-time for any skeleton pose. For this

a SQM algorithm (Bærentzen et al., 2012) is used,

which is able to generate such a mesh specifying only

the skeleton and the radius of a sphere around each

skeleton node (see Figure 4). These radii as well as

skeleton height are dependant on the actor’s body type

and need to be measured or approximated manually.

Such a speciﬁc base mesh is generated only once, and

a pose for every frame is created by applying rotations

from suit sensors and transforming the base mesh ac-

cordingly using a skinning algorithm.

With the virtual camera and scene created and suc-

cessfully calibrated, it is possible to render an image

of the base mesh as if it was seen by a real camera.

Afterwards, this image is processed to obtain only

the silhouette of a rendered image. Finally, a set of

silhouette images is rendered, applying several shifts

of a base mesh in eight evenly distributed directional

vectors. Our set consists of images shifted k times by

d in a space around a speciﬁed pose as well as one im-

age of a base mesh exactly in the current position, as

Optical-inertial Synchronization of MoCap Suit with Single Camera Setup for Reliable Position Tracking

Figure 5: The pipeline of our optical-inertial synchronization. (Top) the camera image is thresholded and transformed into a

distance transform image. (Bottom) a base mesh is constructed based on the acquired skeleton and it is rendered in different

positions. The base of 8 search vectors is used to render 8 shifted silhouettes. The Chamfer matching of the rendered base

mesh and the distance transform image is performed, and the error is evaluated. Finally, the error minimization is used to ﬁnd

the next position in 3D space.

can be seen in Figure 5. In our experiments, we used

constants d = 10cm and k = 3; therefore, in total a set

of 25 images is rendered and stored in the silhouette

database.

3.5 Tracking

The tracking phase begins after the tracking start-up

phase was executed successfully, which properly sets

up a camera for rendering the base mesh silhouette

database. An actor is located in a speciﬁed position

deﬁned by the ChArUco board placed in the scene.

At this point, the tracking phase is ready to start. This

start-up position is considered to be the actor’s true

position in the ﬁrst frame. For each next frame, a mo-

tion vector is evaluated to evaluate the actor’s next

position.

Figure 6: Tracking of the actor position using Chamfer

matching. (Top) an input image and the subtracted back-

ground image in grayscale. (Bottom) applied adaptive

thresholding and calculated distance transform of the sil-

houette.

First, the captured camera image is pre-processed us-

ing background subtraction, thresholded and Canny

edge detection is performed so that only the actor’s

silhouette is obtained. Then, the image is transformed

into a distance transform image (see Figure 6). Af-

terwards, the already pre-rendered database of base

mesh silhouette images is used to evaluate the energy

e for the optimization. The error potential e is calcu-

lated for each silhouette as an integration of the dis-

tance transform function DT(x) over the actor silhou-

ette S as

DT (x(s, t))dsdt, (1)

where parameters s and t are the parameters of the

one-dimensional silhouette curve and a kernel func-

tion that is applied over the curve to make the silhou-

ette wider, respectively. The term x(s, t) is a func-

tion that maps parameters s and t to the image space,

where function DT (x) is evaluated. The integration

in discrete form is performed in the image space as a

sum of non zero pixels from silhouette image S

and

normalized afterwards as

∑

i=0

∑

j=0

DT (i, j) · S

(i, j)

count non zero(S

)

, (2)

where (i, j) refers to a pixel position of the image with

dimensions m × n, DT is the distance transform im-

age, S

is the binary silhouette mask image and the

function count non zero() returns a number of non

zero pixels contained in the image. Minimizing the

error energy, we are able to evaluate the direction and

the magnitude of the subject’s movement based on the

shift vectors used for the construction of a database

image. Adding such a vector to the position of an ac-

tor in the last frame, we are able to evaluate the actor

GRAPP 2018 - International Conference on Computer Graphics Theory and Applications

Figure 7: Comparison of XSens MVN tracking (top) and our optical-inertial evaluation (bottom). In the second frame, the

sliding starts and the mocap system fails to evaluate the position correctly. Images were captured using a Microsoft LifeCam

HD 3000 RGB webcam.

position in the current frame. In the end of the track-

ing phase, we have a raw corrected 3D position of the

actor. The pipeline of the system until this point can

be seen in Figure 5. To enhance the raw data, some

suitable post-processing method might be used.

3.6 Post-processing via Gaussian

Filtering

The change of estimated position of the subject in

time might not be continuous. Thus, it is useful

to post-process the discontinuities into a continu-

ous movement. In our experiments we have tried

two methods of post-processing: a Gaussian-based

smoothing and Kalman ﬁltering. We used local Gaus-

sian smoothing in the neighborhood of 20 time steps

and the Extended Kalman Filter (EKT) (Julier and

Uhlmann, 2004) implemented in OpenCV. Using the

EKT ﬁltering, the resulting graphs seem visually

smoother, but the overall error was higher. Therefore,

in the ﬁnal results, the Gaussian smoothing was used

(see Figure 9).

4 RESULTS

We demonstrate the results of our approach on captur-

ing the motion and estimating the position of a sub-

ject in a space. The subject wears an XSens suit and

the scene is captured using a Microsoft LifeCam HD

3000 RGB webcam and iDS 3 uEye monochromatic

camera with a ﬁsheye lens.

4.1 Correction of Mocap Suit Data

First, the position in 3D space is approximated using

the standard walking algorithm usually implemented

Figure 8: Comparison of error difference in Euclidean dis-

tance between the methods and ground truth obtained us-

ing HTC Vive: (Red) original position from inertial suit

software and (yellow) our optical-inertial method. Graph

is evaluated on a dataset, where the movement was the most

problematic for the original method (e.g. jumping, sliding).

within the mocap software. In this scenario, the ac-

tor starts to run and ﬁnishes the running sequence

by sliding on the ground. The sliding is the stage,

where the inertial mocap suit fails. We use the posi-

tion estimated by XSens MVN Studio and export it

into a bvh ﬁle. Second, the position is evaluated using

our optical-inertial system. Both estimated positions,

from the original method and our camera-based cor-

rection, can be seen in Figure 7.

4.2 Evaluation of Estimated Position

In order to evaluate our method by comparing re-

sults to the ground truth, we evaluate our correction

of position inside a known environment for move-

ment in predeﬁned patterns (see Figure 11). Dur-

ing this evaluation, the subject moves along the de-

ﬁned trajectories with known dimensions. During

the evaluation we track three position estimations in

time. The subject is tracked by HTC Vive lighthouses

(ground truth), and position estimation is done by

Optical-inertial Synchronization of MoCap Suit with Single Camera Setup for Reliable Position Tracking

Figure 9: Comparison of error difference with ﬁltered data: (Red) position calculated by original inertial suit software and

(blue) post-processed versions of estimated positions using our approach. Another example from the datasets is plotted on

the right. In both cases, the smoothed versions of evaluated positions approximate the movement and the real position of the

subject much better.

MVN XSens Studio and our optical-inertial estima-

tion. Comparison of the error difference in Euclidean

distance of the original position from the MVN XSens

Studio and our approach can be seen in Figure 8. Fur-

thermore, we tried to smooth the raw results from our

approach; the graphs with the smoothed positions of

two different datasets are shown in Figure 9. In Fig-

ure 10, the setup used for the evaluation is depicted

and described.

Figure 10: The evaluation setup. The subject wearing a

mocap suit (C) is captured by a camera (A) and tracked by

HTC Vive lighthouses (B). The position estimated by the

lighthouses is used as a ground truth when the original mo-

cap suit tracking is compared to the proposed method.

5 LIMITATIONS

The main limitation of the proposed solution is the de-

pendency on the static background subtraction; thus

we are not able to guarantee robust tracking in scenes

with a dynamically changing background. In the case

of background changes, there are edges in the image

space not related to the actor that may drive the track-

ing into a local minimum. Another limitation of the

system is the predeﬁned set of search directions that

produces discretizetion errors. If required, the search

space could be sampled more densely at the cost of

higher computation time. The rendered base mesh

is only a rough approximation of the human body; a

highly detailed full-body scan could be used for bet-

ter approximation of the silhouette. However, the base

mesh is easy to compute, affordable to acquire and the

results are good enough for our applications.

6 CONCLUSION AND FUTURE

WORK

A system for optical-inertial synchronization of the

mocap suit and the camera was implemented and de-

scribed in this paper. In general, the system can ﬁnd

its utilization in applications such as virtual reality,

movement analysis, sports evaluation, and biometrics.

Using a hybrid mocap system, drift issues of inertial

suits can be solved. Moreover, the lack of positioning

capability of inertial mocap was solved, and therefore

it can be directly used for subject movement analy-

sis in 3D space, ergonomic work analysis process or

virtual reality games. The inertial-optical hybrid sys-

tem is capable of measuring a subject’s position with

high precision even if partially or fully occluded, and

all the computations can be performed in real-time.

These results show promising improvement for iner-

tial suit position tracking, but more extensive evalua-

tion is required in the future.

As future work we would like to use the system for

an automatic and effortless recalibration of the suit.

The correct position and orientation of the joints, eval-

uated from the camera image, can be used for on-line

correction of suit sensors.

ACKNOWLEDGMENTS

We would like to thank the SAIA Action Austria-

Slovakia for ﬁnancial support of the project and the

GRAPP 2018 - International Conference on Computer Graphics Theory and Applications

Figure 11: Evaluation of our method on different trajectories within a grid. The grid has a size of 6 × 6 meters. The patterns

were chosen to fully cover the tracking area in different directions. Moreover, different types of movement were used during

the evaluation, such as walking, jumping, sliding etc.

Interactive Media Systems group at TU Wien for all

the necessary hardware and laboratory space during

the project. Our project was also partly supported by

Comenius University grant No. UK/293/2017.

REFERENCES

Andriluka, M., Roth, S., and Schiele, B. (2010). Monocular

3d pose estimation and tracking by detection. In 2010

IEEE Computer Society Conference on Computer Vi-

sion and Pattern Recognition, pages 623–630.

Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers,

J., and Davis, J. (2005). Scape: Shape completion and

animation of people. ACM Trans. Graph., 24(3):408–

416.

Bærentzen, J., Misztal, M., and Wełnicka, K. (2012). Con-

verting skeletal structures to quad dominant meshes.

Computers & Graphics, 36(5):555 – 561. Shape Mod-

eling International (SMI) Conference 2012.

Cheung, G. K. M., Baker, S., and Kanade, T. (2003). Shape-

from-silhouette of articulated objects and its use for

human body kinematics estimation and motion cap-

ture. In Proceedings of the 2003 IEEE Computer

Society Conference on Computer Vision and Pattern

Recognition, CVPR’03, pages 77–84, Washington,

DC, USA. IEEE Computer Society.

Dimitrijevic, M., Lepetit, V., and Fua, P. (2006). Human

body pose detection using Bayesian spatio-temporal

templates. Comput. Vis. Image Underst., 104(2):127–

139.

Helten, T., Muller, M., Seidel, H.-P., and Theobalt, C.

(2013). Real-time body tracking with one depth cam-

era and inertial sensors. In The IEEE International

Conference on Computer Vision (ICCV).

Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe,

R., Kohli, P., Shotton, J., Hodges, S., Freeman, D.,

Davison, A., and Fitzgibbon, A. (2011). Kinectfu-

sion: Real-time 3d reconstruction and interaction us-

ing a moving depth camera. In Proceedings of the

24th Annual ACM Symposium on User Interface Soft-

ware and Technology, UIST ’11, pages 559–568, New

York, NY, USA. ACM.

Julier, S. J. and Uhlmann, J. K. (2004). Unscented ﬁltering

and nonlinear estimation. Proceedings of the IEEE,

92(3):401–422.

Katz, I. and Aghajan, H. (2008). Multiple camera-based

chamfer matching for pedestrian detection. In 2008

Second ACM/IEEE International Conference on Dis-

tributed Smart Cameras, pages 1–5.

Pons-Moll, G., Baak, A., Helten, T., M

uller, M., Seidel, H.-

P., and Rosenhahn, B. (2010). Multisensor-fusion for

3d full-body human motion capture. In IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR).

Rhodin, H., Richardt, C., Casas, D., Insafutdinov, E.,

Shaﬁei, M., Seidel, H., Schiele, B., and Theobalt,

C. (2017). Egocap: Egocentric marker-less motion

capture with two ﬁsheye cameras (extended abstract).

CoRR, abs/1701.00142.

Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finoc-

chio, M., Blake, A., Cook, M., and Moore, R. (2013).

Real-time human pose recognition in parts from single

depth images. Commun. ACM, 56(1):116–124.

Skogstad, S. A., Nymoen, K., and Hvin, M. (2011). Com-

paring inertial and optical mocap technologies for

synthesis control. In 2011 IEEE Conference on Sys-

tems, Man and Cybernetics.

Stoll, C., Hasler, N., Gall, J., Seidel, H.-P., and Theobalt, C.

(2011). Fast articulated motion tracking using a sums

of gaussians body model. In Computer Vision (ICCV),

2011 IEEE International Conference on, pages 951–

958. IEEE.

von Marcard, T., Pons-Moll, G., and Rosenhahn, B. (2016).

Human pose estimation from video and imus. Trans-

actions on Pattern Analysis and Machine Intelligence,

38(8):1533–1547.

Wojek, C., Walk, S., and Schiele, B. (2009). Multi-cue on-

board pedestrian detection. In 2009 IEEE Conference

on Computer Vision and Pattern Recognition, pages

794–801.

Ziegler, J., Kretzschmar, H., Stachniss, C., Grisetti, G., and

Burgard, W. (2011). Accurate human motion capture

in large areas by combining imu- and laser-based peo-

ple tracking. In IROS, pages 86–91. IEEE.

Optical-inertial Synchronization of MoCap Suit with Single Camera Setup for Reliable Position Tracking