A PASSIVE 3D SCANNER

Acquiring High-quality Textured 3D-models Using a Consumer Digital-camera

Matthias Elter, Andreas Ernst and Christian K

ublbeck

Fraunhofer Institute for Integrated Circuits (IIS), Am Wolfsmantel 33, 91058 Erlangen, Germany

Keywords:

3D scanner, passive 3D reconstruction, shape from stereo, structure from motion, texture mapping, volumetric

fusion, dense stereo.

Abstract:

We present a low-cost, passive 3d scanning system using an off-the-shelf consumer digital camera for image

acquisition. We have developed a state of the art structure from motion algorithm for camera pose estimation

and a fast shape from stereo approach for shape reconstruction. We use a volumetric approach to fuse partial

shape reconstructions and a texture mapping technique for appearance recovery. We extend the state of the

art by applying modiﬁcations of standard computer vision techniques to images of very high resolution to

generate high quality textured 3d models. Our reconstruction results are robust and visually convincing.

1 INTRODUCTION

Acquiring geometric models of physical objects us-

ing active scanning techniques is a solved problem. A

great variety of mature 3d scanning techniques (based

on laser triangulation or structured light) is available

today. However, all of these techniques have the ma-

jor drawback that they acquire an object using active

sensors. Furthermore the required hardware is usu-

ally expensive. A less intrusive and cheaper approach

is to reconstruct both the shape and appearance (color

and texture) of an object from images acquired us-

ing a standard digital camera. A passive sensor like

a standard digital camera is both cheap and does not

change the object which is to be acquired (for exam-

ple by illuminating it). In this paper we present a pas-

sive 3d scanning technique which reconstructs both

3d shape and appearance of arbitrary objects from 2d

images acquired by a consumer digital camera. To

achieve this we have developed a reconstruction ap-

proach based on standard computer vision concepts

like structure from motion and shape from stereo. We

extend the state of the art by applying modiﬁcations

of these techniques to images of very high resolution

to generate high quality models and textures.

2 STATE OF THE ART

Many approaches to passive 3D shape reconstruc-

tion can be found in literature. The most important

basic techniques that are employed are shape from

stereo, shape from silhouette, shape from shading,

and shape from focus. Shape from stereo techniques

reconstruct 3d shape from point correspondences in

two or more images. A taxonomy and evaluation of

shape from stereo algorithms can be found in a re-

cent survey (Scharstein and Szeliski, 2002). Shape

from silhouette algorithms reconstruct the 3d shape of

an object from a sequence of 2d silhouette (contour)

images of the object. Laurentini (Laurentini, 1994)

and Kutulakos (Kutulakos and Seitz, 1998) provide

a theoretical foundation of the concepts exploited by

shape from silhouette algorithms. Implementations

include (Tarini et al., 2002) and (Andrew W. Fitzgib-

bon, 1998). Shape from shading approaches try to es-

timate shape from the shading pattern (light and shad-

ows) of in a single image of an object. Here shape

is estimated in the sense of a ﬁeld of normals from

which a surface can be recovered up to scale. Zhang

and Tsai evaluate six well-known shape from shad-

ing algorithms in a recent survey paper (Zhang et al.,

1999). A technique that tries to estimate object shape

by changing the camera intrinsics is shape from fo-

311

Elter M., Ernst A. and Küblbeck C. (2007).

A PASSIVE 3D SCANNER - Acquiring High-quality Textured 3D-models Using a Consumer Digital-camera.

In Proceedings of the Second International Conference on Computer Vision Theory and Applications - IU/MTSV, pages 311-316

 SciTePress

cus. Image pixels corresponding to different depths,

obviously, will be optimal in focus for different set-

tings of the camera intrinsics like the focal length.

Examples of shape from focus approaches are (Ziou,

1998), (Schechner and Kiryati, 2000) and (Favaro and

Soatto, 2002).

Much less publications on approaches that de-

scribe complete passive 3d scanning systems, includ-

ing shape and appearance recovery, can be found in

literature. Examples are (Wolfgang Niem, 1997),

(Weik, 2000) and (Andrew W. Fitzgibbon, 1998).

3 METHODS

3.1 Image Acquisition and Camera

Calibration

Images are acquired using a cheap (300$) Panasonic

DMC-FZ3 consumer digital camera. We make use of

its continuous drive mode, which allows to continu-

ously take three frames per second at full resolution

of 2015× 1512 pixels. With the camera in continuous

drive mode mounted on a tripod, an image sequence,

showing the object that is to be scanned from multi-

ple viewpoints, is acquired by rotating it in front of

the camera. The intrinsic camera parameters are ob-

tained by our robust extension (Rupp et al., 2006) of

Zhangs classic camera calibration technique (Zhang,

1998). We furthermore use the lens distortion coef-

ﬁcients obtained by the camera calibration to remove

lens distortion effects from the images.

3.2 Structure from Motion

For camera pose estimation we have implemented a

state of the art structure from motion algorithm. Fea-

ture points are detected and tracked from view to view

using the approaches introduced by Kanade, Shi and

Tomasi (Tomasi and Kanade, 1991; Shi and Tomasi,

1994). Due to occlusion and because of features leav-

ing the ﬁeld of view of the camera, points usually

can not be tracked throughout all views of the image

sequence. Therefore lost features are constantly re-

placed by newly detected points while processing the

image sequence. We have implemented a sequential

structure from motion algorithm: ﬁrst an initial pair

of views and initial structure is reconstructed. Then

additional views and additional structure are sequen-

tially added to the initial reconstruction. Figures 1, 2

and 3 illustrate the sequential approach of our algo-

rithm.

Figure 1: Initial views (gray) and structure (blue) of a hu-

man head scene. View positions and orientations are illus-

trated using pyramid glyphs.

3.2.1 Initial Structure and Motion

The fundamental matrix F for the initial pair of views

and v

is estimated using a robust estimator. We

have developed a genetic algorithm approach similar

to the one described by Rodehorst (Rodehorst, 2004)

instead of using the classic RANSAC algorithm. We

are using a calibrated camera and hence can obtain the

essential matrix E directly from F. We then obtain the

metric camera projection matrices P

and P

corre-

sponding to the two initial views from the E by means

of factorization (Hartley and Zisserman, 2003). Using

and P

, initial structure is then obtained from the

corresponding feature points of the initial view pair

by means of triangulation.

3.2.2 Adding Views

Pose and structure of the remaining views of the im-

age sequence are now sequentially added to the re-

construction. Based on 3D/2D point correspondences

between the already reconstructed structure and fea-

ture points in a new view v

, its camera projection

matrix P

can be estimated. Again, we are using a

genetic algorithm as a robust estimator. Already re-

constructed structure is then reﬁned and new points

are added by means of triangulation. For robust tri-

angulation of feature points that are visible in more

than the minimal two views, we use a robust least-

median-of-squares based estimator. Once all views of

the image sequence are added to the reconstruction, it

is reﬁned by a global optimization step using bundle

adjustment (Brown, 1976).

3.3 Shape from Stereo

Structure from motion as described above results in

a sparse reconstruction (point cloud) of an object.

We however are interested in a dense reconstruction.

Hence we use shape from stereo concepts to obtain

Figure 2: The human head scene after adding seven addi-

tional views.

Figure 3: The full structure from motion reconstruction of

the human head scene. The structure (blue) is discarded

and the views (gray) are used for the following shape from

stereo step.

dense reconstructions from pairs of close views of the

image sequence. We then merge these 2-view recon-

structions using a volumetric fusion approach to ob-

tain the desired full reconstruction of an object.

3.3.1 Obtaining 2-View Reconstructions

We reconstruct partial reconstructions from two views

each using a fast template matching algorithm. We

apply two preprocessing steps to reduce the complex-

ity of the matching problem and to improve the ro-

bustness of the dense stereo matcher. We reduce the

correspondence search space by one dimension us-

ing nonlinear image rectiﬁcation (Oram, 2001). The

rectiﬁed images are then transfered from color space

to census space using the census transform (Zabih

and Woodﬁll, 1994). The census transform is a non-

parametric local transformation that relies on the rel-

ative ordering of local intensity values instead of the

intensity values them self. Our own experiments and

a recent survey (Brown et al., 2003) indicate that

matching in census space increases the robustness

Figure 4: Source images (top left) are rectiﬁed (top middle)

and the census transform is applied (top right). The green

line illustrates an epipolar line which is equivalent with a

scanline after rectiﬁcation. Using template matching a dis-

parity map (bottom left) and a conﬁdence map (bottom mid-

dle) are obtained. Finally a triangle mesh is created (bottom

right).

against lighting differences and occlusions. We have

developed a template matching based dense stereo

matcher. We match in census space where the stan-

dard difference metrics like normalized cross correla-

tion can not be used. Instead we use the Hamming

distance. Given the very high resolution of our in-

put images and the fact that we work on three color

channels, acceptable CPU complexity for the dense

matcher can only be achieved by avoiding redundant

computations. Hence we have implemented a very

fast computation scheme that avoids redundant com-

putations (Stefano and Mattoccia, 2000). The result-

ing disparity map contains errors, which are due to

texture-less areas, repetitive patterns and occlusions.

We enforce the uniqueness and the left-right stereo

constraints to identify and remove most of the erro-

neous areas in the disparity maps. As the integer ac-

curacy of the block matcher would lead to step effects

in the reconstruction and hence is not good enough for

a smooth reconstruction we use a postprocessing step

to achieve subpixel accuracy (Frischholz, 1997). For

the volumetric fusion, we are also interested in the

conﬁdence of the disparity values. Hence we deﬁne

the difference of the minimal matching difference and

the mean matching difference as a conﬁdence metric.

Finally we obtain a triangle mesh from the disparity

map by means of triangulation. Figure 4 illustrates

the individual steps of our 2-view reconstruction al-

gorithm.

3.3.2 Volumetric Fusion

The 2-view reconstructions are now fused using a vol-

umetric approach. The triangle meshes are transfered

Figure 5: The two red line segments are the 2D equivalents

to a patch of two triangles in 3D. Casting rays from the cam-

era center C through the limiting points of one line segment,

voxels in the area between these two rays need to be tested.

Because the signed distance function is deﬁned to fall of at

a certain distance in front and behind the surface, the area

can be further constrained to the light green area for the ﬁrst

line segment and the yellow area for the second.

to volumetric functions, which can be more easily and

robustly fused. From the fused volume function a ﬁ-

nal triangle mesh is then retrieved by means of iso-

surface extraction. Our approach is based on an exist-

ing approach (Curless and Levoy, 1996) but improves

the way partial reconstructions are converted to the

volumetric function representation.

The volumetric function represents a surface by

mapping points in R

to a vector in R

, representing

the signed distance of the point to the surface along

the line of sight of the camera. Like Curless and

Levoy we use a discrete representation of this vol-

umetric function called a weighted signed distance

(WSD) grid, which is a regular volume of samples

of the continuous function. In theory the volumetric

function should extend indeﬁnitely from the surface

in both directions. To avoid surfaces on opposite sides

of an object to interfere with each other, it is forced

to fall-off in a certain distance of the surface. To con-

vert a triangle mesh to a WSD representation, for each

WSD grid voxel a sampling value of the WSD func-

tion needs to be found. Curless and Levox propose a

scan-conversion process for the conversion. We have,

however, developed a both simple and efﬁcient alter-

native to their approach: The WSD value for a voxel

can be found by casting a ray from the camera cen-

ter through the voxel and then intersecting it with the

triangle mesh. Our approach exploits the fact that the

WSD functions are only deﬁned in a certain distance

of the surface. Hence we obtain the WSD values only

for voxels that are close enough to the surface. Fig-

ure 5 explains this concept.

Partial reconstruction in WSD representation can

be merged easily and robustly using a additive scheme

proposed by Curless and Levoy. The cumulative func-

tions D(x, y, z) and W(x, y, z) of the merged WSD vol-

ume of n partial volumes are deﬁned as

D(x, y, z) =

∑

i=1

(x, y, z)d

(x, y, z)

∑

i=1

(x, y, z)

W(x, y, z) =

∑

i=1

(x, y, z).

Rewritten as an incremental calculation, the cu-

mulative signed distance and weight functions after

merging the ith partial reconstruction read

i+1

(x, y, z) =

(x, y, z)W

(x, y, z)

(x, y, z) +w

i+1

(x, y, z)

i+1

(x, y, z)d

i+1

(x, y, z)

(x, y, z) +w

i+1

(x, y, z)

i+1

(x, y, z) = W

(x, y, z) +w

i+1

(x, y, z).

Merging auxiliary WSD volumes into the global

WSD volume therefore is simply a matter of applying

the rules deﬁned above to each voxel.

A triangle mesh representation of the merged

WSD volume can be obtained at any time during

the sequential fusion of partial reconstructions from

the WSD volume using the marching cubes isosur-

face extraction algorithm (Lorensen and Cline, 1987).

The isosurface is deﬁned as D(x, y, z) = 0. Therefore,

voxel conﬁgurations can directly be found by testing

the signs of the signed distance function D(x, y, z) at

the eight corners of a voxel. To skip empty space,

only voxels which have nonzero weights W(x, y, z) for

all eight corners are processed.

3.4 Appearance Recovery

Recovering the shape of an object is not enough for

a visually convincing reconstruction. Besides the

shape, appearance information, like color and texture,

is very important. We recover appearance by means

of a simple texture mapping approach. We generate

both texture coordinates and the texture map triangle

by triangle: for each triangle all views in which its

projection is front facing and not occluded are found.

The triangle is then assigned to the view which en-

sures the highest possible texture resolution. Once

all triangles are assigned to views, rectangular tex-

ture snippets corresponding to the bounding boxes of

the projections of the triangles are obtained from the

views and added to a list sorted by the width of the

snippets. A texture map can now be populated with

the texture snippets for each triangle, by traversing

Figure 6: Cutout of a texture image populated column by

column with rectangular texture snippets for each shape tri-

angle. The full texture has a resolution of 2048× 2048 pix-

els.

Figure 7: Sample images of a 90 view image sequence of a

human head (top) and textured as well as untextured images

of the reconstruction results (bottom).

the sorted list and ﬁlling the image with rectangular

texture snippets column by column. Figure 3.4 shows

a cutout of a generated texture image. Finally, the po-

sitions of the texture snippets are normalized to [0, 1]

to match texture mapping standards and stored as tex-

ture coordinates, deﬁning the mapping.

4 EXPERIMENTS AND RESULTS

We have acquired and reconstructed more than 30 im-

age sequences. The full reconstruction of a 90 view

image sequence takes about two hours on a Pentium

IV machine. A selection of typical results is shown

in Figure 7, Figure 8 and Figure 9. While the recon-

struction results are visually convincing we are also

interested in an objective metric for the reconstruction

accuracy. This however requires ground truth refer-

ence data (for example acquired using a laser scanner)

which was not available to us. Hence we have at least

tried to determine the robustness of our approach. For

this purpose. we have acquired and reconstructed

three different image sequences of the same person.

The reconstruction results should be and as Figure 10

shows are very similar.

Figure 8: Sample images of a 85 view image sequence of

a piece of wood (top) and textured as well as untextured

images of the reconstruction results (bottom).

Figure 9: Sample images of a 92 view image sequence of

a plastic dinosaur (top) and textured as well as untextured

images of the reconstruction results (bottom).

5 CONCLUSION

We have developed a passive 3d reconstruction ap-

proach based on structure from motion and shape

from stereo concepts. While the results are visually

convincing we have not yet veriﬁed the reconstruc-

Figure 10: Reconstructions of three different image se-

quences of the same object show the robustness of our ap-

proach.

tion accuracy using ground truth data and an objec-

tive metric. Furthermore our reconstruction approach

suffers from several principle shape from stereo prob-

lems like untextured areas, occlusions and repetitive

patterns. Hence it can not be used for the reconstruc-

tion of untextured or specular objects. To solve these

problems we are working on acquiring ground truth

data using artiﬁcial image sequences generated from

textured 3d models and we try to combine our shape

from stereo based approach with shape from silhou-

ette concepts to improve the reconstruction quality of

sparsely-textured objects.

REFERENCES

Andrew W. Fitzgibbon, Geoff Cross, A. Z. (1998). Auto-

matic 3d model construction for turn-table sequences.

In 3D Structure from Multiple Images of Large-Scale

Environments: European Workshop, page 155.

Brown, D. (1976). The bundle adjustment - progress and

prospect. In XIII Congress of the ISPRS, Helsinki.

Brown, M., Burschka, D., and Hager, G. (2003). Advances

in computational stereo. In IEEE Transactions on Pat-

tern Analysis and Machine Intelligence, volume 25,

pages 993–1008.

Curless, B. and Levoy, M. (1996). A volumetric method

for building complex models from range images. In

SIGGRAPH ’96: Proceedings of the 23rd annual con-

ference on Computer graphics and interactive tech-

niques, pages 303–312. ACM Press.

Favaro, P. and Soatto, S. (2002). Learning depth from de-

focus. In ECCV 2002 : 7th European Conference on

Computer Vision, page 735ff, Copenhagen, Denmark.

Frischholz, R. W. (1997). Beitrge zur automatischen dreidi-

mensionalen Bewegungsanalyse. PhD thesis, Univer-

sitt Erlangen N

urnberg.

Hartley, R. and Zisserman, A. (2003). Multiple View Geom-

etry in Computer Vision. Cambridge University Press.

Kutulakos, K. N. and Seitz, S. M. (1998). A theory of shape

by space carving. Technical Report TR692.

Laurentini, A. (1994). The visual hull concept for

silhouette-based image understanding. IEEE Trans.

Pattern Anal. Mach. Intell., 16(2):150–162.

Lorensen, W. E. and Cline, H. E. (1987). Marching cubes:

A high resolution 3d surface construction algorithm.

In SIGGRAPH ’87: Proceedings of the 14th an-

nual conference on Computer graphics and interac-

tive techniques, pages 163–169. ACM Press.

Oram, D. (2001). Rectiﬁcation for any epipolar geometry.

In British Machine Vision Conference (BMVC), pages

653–662.

Rodehorst, V. (2004). Photogrammetrische 3D-

Rekonstruktion im Nahbereich durch Auto-

Kalibrierung mit projektiver Geometrie. PhD

thesis, TU Berlin.

Rupp, S., Elter, M., Breitung, M., Zink, W., and K

ublbeck,

C. (2006). Robust Camera Calibration using Discrete

Optimization. Enformatika Transactions on Engineer-

ing, Computing and Science, 13:250 – 254.

Scharstein, D. and Szeliski, R. (2002). A taxonomy and

evaluation of dense two-frame stereo correspondence

algorithms. International Journal of Computer Vision,

47(1/2/3):7–42.

Schechner, Y. Y. and Kiryati, N. (2000). Depth from defo-

cus vs. stereo: How different really are they? Int. J.

Comput. Vision, 39(2):141–162.

Shi, J. and Tomasi, C. (1994). Good features to track. In

IEEE International Conference on Computer Vision

and Pattern Recognition, pages 593–600, Seattle.

Stefano, L. D. and Mattoccia, S. (2000). Fast stereo match-

ing for the videt system using a general purpose pro-

cessor with multimedia extensions. In CAMP ’00:

Proceedings of the Fifth IEEE International Workshop

on Computer Architectures for Machine Perception

(CAMP’00), page 356. IEEE Computer Society.

Tarini, M., Callieri, M., Montani, C., Rocchini, C., Ols-

son, K., and Persson, T. (2002). Marching intersec-

tions: An efﬁcient approach to shape-from-silhouette.

In VMV, pages 283–290.

Tomasi, C. and Kanade, T. (1991). Detection and tracking

of point features. Technical Report CMU-CS-91-132,

Carnegie Mellon University.

Weik, S. (2000). A passive full body scanner using shape

from silhouettes. In International Conference on Pat-

tern Recognition, volume 1, pages 750 – 753.

Wolfgang Niem, J. W. (1997). Automatic reconstruction

of 3d objects using a mobile monoscopic camera. In

International Conference on Recent Advances in 3-D

Digital Imaging and Modeling, pages 173 – 180.

Zabih, R. and Woodﬁll, J. (1994). Non-parametric local

transforms for computing visual correspondence. In

ECCV ’94: Proceedings of the third European confer-

ence on Computer Vision, volume 2, pages 151–158,

Stockholm, Sweden. Springer-Verlag New York, Inc.

Zhang, R., Tsai, P.-S., Cryer, J. E., and Shah, M. (1999).

Shape from shading: A survey. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,

21(8):690–706.

Zhang, Z. (1998). A ﬂexible new technique for camera cali-

bration. Technical Report MSR-TR-98-71, Microsoft

Corporation.

Ziou, D. (1998). Passive depth from defocus using a spatial

domain approach. In ICCV ’98: Proceedings of the

Sixth International Conference on Computer Vision,

page 799, Washington, DC, USA. IEEE Computer So-

ciety.