Combining Gesture and Voice Control for Mid-air Manipulation of CAD

Models in VR Environments

Markus Friedrich, Stefan Langer and Fabian Frey

Institute of Informatics, LMU Munich, Oettingenstraße 67, 80538 Munich, Germany

Keywords:

CAD Modeling, Virtual Reality, Speech Recognition, Constructive Solid Geometry.

Abstract:

Modeling 3D objects in domains like Computer Aided Design (CAD) is time-consuming and comes with a

steep learning curve needed to master the design process as well as tool complexities. In order to simplify the

modeling process, we designed and implemented a prototypical system that leverages the strengths of Virtual

Reality (VR) hand gesture recognition in combination with the expressiveness of a voice-based interface for

the task of 3D modeling. Furthermore, we use the Constructive Solid Geometry (CSG) tree representation

for 3D models within the VR environment to let the user manipulate objects from the ground up, giving an

intuitive understanding of how the underlying basic shapes connect. The system uses standard mid-air 3D

object manipulation techniques and adds a set of voice commands to help mitigate the deﬁciencies of current

hand gesture recognition techniques. A user study was conducted to evaluate the proposed prototype. The

combination of our hybrid input paradigm shows to be a promising step towards easier to use CAD modeling.

1 INTRODUCTION

Current Computer Aided Design (CAD) modeling

tools use mainly mouse control-based object manipu-

lation techniques in predeﬁned views of the modelled

objects. However, manipulating 3D objects with a 2D

input device (mouse) on a 2D output device (moni-

tor) always necessitates a complex transfer, making it

difﬁcult for beginners to grasp the needed interaction

concepts.

With the advent of affordable Virtual Reality (VR)

devices and robust hand gesture recognition systems,

more possibilities are at the ﬁngertips of interaction

designers and researchers. The goal is to leverage the

potential of these immersive input methodologies to

improve intuitiveness, ﬂattening the learning curve,

and increasing efﬁciency in the 3D modeling task.

However, hand gesture recognition systems face

many challenges in productive environments. Com-

pared to expert devices like 3D mouses, these systems

lack robustness as well as precision. Moreover, con-

stant arm movements can cause fatigue syndroms like

the Gorilla Arm Syndrom (LaValle, 2017), making it

hard to use systems over an extended period of time.

This is where we hypothesize that a hybrid approach,

combining gesture recognition and voice control, is

beneﬁcial. While using hands for intuitive model

part manipulation, voice commands replace complex

to recognize gestures with simple to say and easy to

memorize word commands.

In this work, we propose a novel interaction con-

cept, implemented as a prototype, which combines

hand gesture recognition and voice control for CAD

model manipulation in a VR environment. The pro-

totype uses gesture-based mid-air 3D object manip-

ulation techniques where feasible and combines it

with a set of voice commands to complement the in-

teraction concept. CAD models are represented as

a combination of geometric primitives and Boolean

set-operations (so called Constructive Solid Geome-

try (CSG) trees) enabling transformation operations

that are more intuitive for beginners. The whole sys-

tem is evaluated in a user study, showing its potential.

The paper makes the following contributions:

• A new interaction concept for intuitive, CSG

tree-based, CAD modeling in VR leveraging the

strengths of both, gesture- and voice-based inter-

actions.

• A prototypical implementation of the interaction

concept with off-the-shelf hard- and software.

• A detailed user study proving the advantages of

the proposed approach.

The paper is structured as follows: Essential terms

are explained in Section 2. Related work is discussed

in Section 3 which focuses on mid-air manipulation

techniques. This is followed by an explanation of the

concept (Section 4) which is evaluated in Section 5.

Friedrich, M., Langer, S. and Frey, F.

Combining Gesture and Voice Control for Mid-air Manipulation of CAD Models in VR Environments.

DOI: 10.5220/0010170501190127

In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 2: HUCAPP, pages

119-127

ISBN: 978-989-758-488-6

119

Since the proposed prototype opens up a multitude of

different new implementation and research directions,

the paper concludes with a summary and a short de-

scription of possible future work (Section 6).

2 BACKGROUND

2.1 Construction Trees

Construction or CSG trees (Requicha, 1980) is a rep-

resentation and modeling technique for geometric ob-

jects mostly used in CAD use cases. Complex 3D

models are created by combining primitive geomet-

ric shapes (spheres, cylinders, convex polytopes) with

Boolean operators (union, intersection, difference,

complement) in a tree-like structure. The inner nodes

describe the Boolean operators while the leave nodes

represent primitives. The CSG tree representation has

two big advantages over other 3D model representa-

tions: Firstly, it is memory-saving, and secondly, it is

intuitive to use.

2.2 Mid-air Interactions for 3D Object

Manipulation

The manipulation of 3D objects in a virtual scene is

a fundamental interaction in immersive virtual envi-

ronments. Manipulation is the task of changing char-

acteristics of a selected object using spatial transfor-

mations (Bowman and Hodges, 1999). Translation,

rotation, and scaling are referred to as basic manipu-

lation tasks (Bowman et al., 2004). Each task can be

performed in any axis direction. In general, a single

transformation in one speciﬁc axis is deﬁned as a de-

gree of freedom (DOF). Thus, a system that provides

a translation-only interface on all three axes supports

three DOFs whereas a system that offers all three

manipulation tasks on all axes has 9 DOFs (Mendes

et al., 2019).

In so-called mid-air interactions, inputs are passed

on through the user’s body, including posture or hand

gestures (Mendes et al., 2019). This kind of inter-

action provides a particularly immersive and natural

way to grab, move, or rotate objects in VR which

allows for more natural interfaces that can increase

both, usability and user performance (Caputo, 2019).

In this paper, the term mid-air manipulation refers

speciﬁcally to the application of basic transformations

to virtual objects in a 3D environment using hand ges-

tures. Apart from the allowed degrees of freedom, ex-

isting techniques can be classiﬁed by the existence of

a separation between translation, rotation, and scaling

(Mendes et al., 2019) which also affects manipula-

tion precision. Furthermore, two additional interac-

tion categories exist: Bimanual and unimanual. Bi-

manual interfaces imply that users need both hands

to perform the intended manipulation, whereas appli-

cations using unimanual interfaces can be controlled

with a single hand (Mendes et al., 2019).

3 RELATED WORK

Object Manipulation. The so-called Handle Box ap-

proach (Houde, 1992) is essentially a bounding box

around a selected object. The Handle Box consists

of a lifting handle, which moves the object up and

down, as well as four turning handles to rotate the ob-

ject around its central axis. To move the object hori-

zontally, an activation handle is missing and instead,

the user can easily click and slide.

In (Conner et al., 1992), so-called Virtual Han-

dles that allow full 9 DOF control are proposed. The

handles have a small sphere at their ends, which are

used to constrain geometric transformations to a sin-

gle plane or axis (Mendes et al., 2019). The user

selects the manipulation mode with the mouse but-

ton. During rotation, the initial user gesture is recog-

nized to determine the rotation axis. Both techniques

are initially designed for classic mouse/screen inter-

actions but are seamlessly transferable to hand ges-

ture/VR environments.

Mid-air Interactions. Caputo et al. state that users

often like mid-air interactions more than other meth-

ods, ﬁnding the accuracy in manipulation sufﬁciently

good for many proposed tasks (Caputo, 2019). First

systems used gloves or other manual input devices

providing the functionality to manipulate or to grab

objects. In (Robinett and Holloway, 1992), a system

for translating, rotating, and scaling objects in VR is

proposed. The formalization of the object manipula-

tion task in three sub tasks, namely (object) selection,

(object) transformation and (object) release was pro-

posed in (Bowman and Hodges, 1999). One of the

most inﬂuential techniques in this ﬁeld is the Hand-

centered Object Manipulation Extending Ray-casting

(HOMER) method (Bowman and Hodges, 1997). Af-

ter selecting an object with a light ray, the hand moves

to the object. The hand and the object are linked and

the user can manipulate the object with a virtual hand

until the object has been dropped. After that, the con-

nection is removed and the hand returns to its natural

position.

Metaphors for Manipulation. A bimanual 7 DOF

manipulation technique is the Handlebar introduced

by Song et al. (Song et al., 2012). It is adopted

HUCAPP 2021 - 5th International Conference on Human Computer Interaction Theory and Applications

120

by multiple real-world applications. The Handlebar

uses the relative position and movement of both hands

to identify translation, rotation, scaling, and to map

the transformation into a virtual object. The main

strength of this method is the intuitiveness due to the

physical familiarity with real-world actions like ro-

tating and stretching a bar. However, holding an arm

position may exhaust the user (Gorilla Arm Syndrome

(LaValle, 2017)).

In (Wang et al., 2011), manipulations (in particu-

lar translation and rotation) are separated. Translation

is applied by moving the closed hand and rotation by

using two hands around the three main axes, mimick-

ing the physical action of rotating a sheet of paper.

The most direct unimanual method is the Simple Vir-

tual Hand metaphor (Caputo, 2019) where objects are

selected by grasping them. After the selection, move-

ment and rotation of the user’s hand are mapped di-

rectly to the objects. This method is characterized by

its high intuitiveness. However, accurate, reliable and

robust hand tracking is required. Kim et al. extended

this idea in (Kim and Park, 2014) by placing a sphere

around the selected object and additionally enable the

user to control the reference frame during manipula-

tion which shows great improvements, in particular

for rotation.

In order to demonstrate the advantages of DOF

separation, Mendes et al. applied the Virtual Han-

dles concept (Conner et al., 1992) to VR environ-

ments (Mendes et al., 2016). Since Virtual Handles

allow users to choose a single axis, transformations in

unwanted axes are improbable. However, transforma-

tions in more than one axis take a bit more time. We

use this concept for basic transformations in our pro-

totype (see Figure 4a).

Multimodal Object Manipulation. Chu et al. (Chu

et al., 1997) presented a multimodal interface for

CAD Systems in VR as early as 1997, concluding

that voice commands and hand gestures are supe-

rior to eye tracking. Zhong et al. (Zhong and Ma,

2004) propose an approach for constraint-based ma-

nipulations of CAD models in a VR environment.

The authors implemented a prototype for a three-

level model comprising of a high-level model for ob-

ject deﬁnition, a mid-level, hybrid, CSG and Bound-

ary Representation model for object creation, and a

polygon-level based model for visualization and in-

teraction. Zheng et al. (Zheng et al., 2000) also pro-

pose a VR-based CAD system for geometric model-

ing. In order to enhance the human-machine inter-

action, they add an electronic data glove as an addi-

tional input device for constructing, destroying and

freeform-creating objects. The authors claim these in-

teraction techniques to be more suitable and intuitive

to humans than traditional manipulation devices. Lee

et al. (Lee et al., 2013) conducted a usability study

of multimodal inputs in Augmented Reality environ-

ments. The authors compare speech-only and gesture-

only approaches to a hybrid solution and ﬁnd that

the multimodal approach is a more satisfying inter-

action concept for users. Xue et al. (Xue et al., 2009)

improve the voice command based input model for

CAD software by making the interaction more natural

through natural language processing. Kou et al. (Kou

et al., 2010), as well, focus on natural language pro-

cessing, utilizing a template matching approach, en-

abling the system to process unknown expressions as

well.

4 CONCEPT

In order to simplify the modeling process as much

as possible, we utilized mid-air manipulation tech-

niques without any controllers or gloves - hand track-

ing only. The main feature of our technique is the

use of only the grasping gesture. This simple gesture

allows the usage of low-cost hand-tracking sensors,

more speciﬁcally the Leap Motion Controller sold by

Ultraleap. More complex instructions are operated

via voice control. To the best of our knowledge, this is

the ﬁrst approach that combines mid-air manipulation

techniques with voice commands in order to create an

intuitive and fast way of manipulating CSG-based 3D

objects in VR.

4.1 System Overview

The proposed system consists of hardware and soft-

ware components and is detailed in Figure 1. The

hand tracking controller (Leap Motion Controller)

is mounted on the VR headset which is an off-the-

shelf HTC Vive that uses laser-based head tracking

and comes with two 1080x1200 90Hz displays. The

speech audio signal is recorded using a wearable mi-

crophone (Headset). A central computing device

(Workstation) takes the input sensor data, processes

it, applies the interaction logic and renders the next

frame which is then displayed in the VR device’s

screens.

The software architecture consists of several sub-

components. The Gesture Recognition module trans-

lates input signals from the Leap Motion sensor into

precise hand poses and furthermore recognizes hand

gestures by assembling temporal chains of poses. The

Speech Recognition module takes the recorded speech

audio signal and translates it into a textual representa-

tion. The CSG model, which is later used for editing,

Combining Gesture and Voice Control for Mid-air Manipulation of CAD Models in VR Environments

121

HTC Vive with mounted

Leap Motion Controller

Workstation Headset

Head and hand pose

Image of rendered scene Audio signal (speech)

Hardware Architecture

Software Architecture

Command

Processor

Speech Recognition

Microsoft

Windows Speech

Recognition

Gesture Recognition

Leap Motion

Unity API

VR Engine & Interaction Logic

Unity3D

SteamVR

Interaction Logic

Triangle

Mesh

Transformer

CSG Tree

Parser

Figure 1: Architecture overview of the developed system.

is parsed by the CSG Tree Parser based on a JSON-

based input format and transformed into a triangle

mesh by the Triangle Mesh Transformer. This step is

necessary since the used rendering engine (Unity3D)

is restricted to this particular geometry representa-

tion. The main component (VR Engine & Interaction

Logic) reacts on recognized gestures and voice com-

mands, executes corresponding actions (Interaction

Logic), and handles VR scene rendering (Unity3D,

SteamVR). For example, if the user applies a basic

scaling transformation to a model part consisting of

a sphere, the geometric parameters of the sphere (in

this case its radius) are changed based on the hand

pose. Then, the model is re-transformed into the tri-

angle mesh representation, rendered, and the resulting

image is sent to the VR device.

4.2 Interaction Concept

The interaction concept is based on hand gestures

(visible as two virtual gloves that visualize the user’s

hand and ﬁnger poses) and simple voice commands.

It furthermore consists of two tools, the model tool

and the tree tool, as well as the information board

serving as a user guidance system.

4.2.1 Model Tool

The model tool is used for direct model manipula-

tion and is active right after system startup. It can

be divided in two different input modes, the selec-

tion mode and the manipulation mode. Modes can

be switched using the voice command ’select’ (to

selection mode) and one of the manipulation com-

mands ’scale’, ’translate’ and ’rotate’ (to manipula-

tion mode). Figure 2 details the interaction concept

of the model tool.

The main reason for separating interactions that

way is the ambiguity of hand gestures. In some cases,

the recognition system was unable to separate be-

tween a model part selection and a translation or ro-

tation transformation since both use the same grasp

gesture. The grasp gesture was chosen since initial

tests which involved basic recognition tasks for all

supported gestures revealed that it is the most robustly

recognized gesture available.

<<move hand into

primitive>>

Selection Mode

Manipulation Mode

"append" "remove"

Primitive is

highlighted

Primitive is

selected

Primitive is de-

selected

"group"

"un-group"

Selected

primitives are

grouped

Group does not

exist anymore

"translate" "rotate"

Start

"scale"

Virtual handles

are visible

<<grab virtual handles

and perform

transformations>>

"select"

Figure 2: Interaction concept of the model tool.

Selection Mode. The initial application state can

be seen in Figure 3 (a): The triangle mesh of the

loaded model is displayed in grey. The user can enter

the selection mode by saying the word ’select’. The

mode change is additionally highlighted through an

on-screen message. Using virtual hands, the user can

enter the volume of the model which highlights the

hovered primitives in green (Figure 3 (b)). This imi-

HUCAPP 2021 - 5th International Conference on Human Computer Interaction Theory and Applications

122

tates the hovering gesture well known from desktop-

based mouse interfaces. Once highlighted, the user

can append the primitive to the list of selected primi-

tives by using the voice command ’append’. Selected

primitives are rendered in red (Figure 3 (c)). This

way, multiple primitives can be selected. In order to

remove a primitive from the selection, the user’s hand

must enter the primitives volume and use the voice

command ’remove’. Multiple selected primitives can

be grouped together with the voice command ’group’

which is useful in situations where the selected primi-

tives should behave like a single primitive during ma-

nipulation, e.g., when a rotation is applied. A group

is displayed in blue (see Figure 3 (d)) and can be dis-

solved by saying ’un-group’.

(a) Initial State. (b) Highlighted.

Figure 3: Illustration of all possible states of a primitive.

Manipulation Mode. For the manipulation of se-

lected or grouped primitives (in the following: a

model part), three basic transformations are avail-

able: translation, rotation and scaling. The currently

used transformation is selected via a voice command

(’translate’, ’rotate’ and ’scale’). The manipulation

mode is entered automatically after invoking one of

these commands. The selected transformation is high-

lighted through the displayed Virtual Handles (Con-

ner et al., 1992) (three coordinate axes with a small

box at their ends, pointing in x-, y-, and z-direction as

shown in Figure 4a).

Rotation. Per-axis rotations are done by grabbing the

small boxes at the end of each coordinate axis and

performing a wrist rotation. The rotation is directly

applied to the corresponding model part axis. Alter-

natively, the sphere displayed at the center of the coor-

dinate system can be grabbed and rotated. The sphere

rotations are directly applied to the model part. This

by-passes the restrictions of per-axis rotations and al-

lows for faster manipulation at the cost of precision.

Translation. Per-axis translations work like per-axis

rotations. However, instead of rotating the wrist, the

grabbed boxes can be moved along the corresponding

axis which results in a direct translation of the model

part. The sphere in the center of the coordinate axis

can be grabbed and moved around without any restric-

tions. Resulting translations are directly applied to the

model part.

Scaling. Per-axis scaling works similar to per-axis

translation. The use of the sphere at the center of

the coordinate system is not supported since it cannot

be combined with a meaningful gesture for scaling.

The transformation to use can be switched in manip-

(a) Virtual Handles. (b) Highlighting.

Figure 4: Virtual handles (a) and the tree tool (b).

ulation mode by invoking the aforementioned voice

commands. If the primitive selection or group should

be changed, a switch to the selection mode is neces-

sary.

4.2.2 Tree Tool

The tree tool displays a representation of the model’s

CSG tree using small spheres as nodes. It appears

above the user’s left hand when the user holds it up-

wards. Each leaf node corresponds to a primitive,

the inner nodes represent the Boolean operators. Op-

eration nodes have textures that depict the operation

type (∪, ∩, −). The user can change the operation

type of a node by grabbing the corresponding sphere

and invoking one of the following voice commands

(’change to union’, ’change to inter’, ’change to sub’).

The tree tool also allows highlighting multiple prim-

itives at once by grabbing their parent node. Once

primitives are highlighted, their corresponding nodes

are displayed in green as well (see Figure 4b).

4.2.3 Information Board

The Information Board depicts the current state of the

application, the manipulation task, and all voice com-

mands including their explanations. The board is al-

ways visible as a billboard and helps the user to mem-

orize voice commands and to be aware of the current

interaction mode.

Combining Gesture and Voice Control for Mid-air Manipulation of CAD Models in VR Environments

123

5 EVALUATION

A usability study was conducted in order to evaluate

the proposed interaction concept. Its goal was to val-

idate whether the concept is easy to understand, also

for novices in the ﬁeld of VR and CAD, and whether

the combination of hand gesture- and voice control is

perceived as a promising idea for intuitive CAD mod-

eling.

5.1 Participants

Five student volunteers, three females and two males

participated in the study. Two of them never used VR

headsets before. Four participants have a background

in Computer Science, one participant is a student of

Molecular Life Science. Details about the participants

are depicted in Table 1.

Table 1: Study participants. Abbreviations: Background

in Computer Science (CS), User Experience Design (UXD)

and Molecular Life Science (MLS). Experience in Com-

puter Aided Design (CAD) or Virtual Reality (VR).

Gender Age Background VR CAD

male 30-40 UXD Yes No

female 20-30 CS Yes Yes

female 20-30 MLS No No

female 20-30 CS No No

male 20-30 CS No No

5.2 User Study Setup

Participants could move around freely within a radius

of two square meters. The overall study setting is

shown in Figure 5. As an introduction, the partici-

Figure 5: Participants conducting the user study.

pants were shown a simple CAD object (Object 1 in

Figure 6) together with the information board to fa-

miliarize themselves with the VR environment. Up-

coming questions were answered directly by the study

director.

After the introduction, the participants were asked

to perform selection and modiﬁcation tasks (see Table

2 for the list of tasks) where they were encouraged to

share their thoughts and feelings using a Thinking-

Aloud methodology (Boren and Ramey, 2000). The

tasks were picked such that all possible interactions

with the prototype were covered. After completing

these tasks, two more objects (Objects 2 and 3 in Fig-

ure 6) were shown to the participants, and they were

asked to do the same tasks as with Object 1. The

whole process was recorded on video for further re-

search and evaluation purposes.

The participants’ actions, comments and experi-

enced difﬁculties were logged in order to distill the

advantages and disadvantages of the proposed inter-

action concept. In addition, problems that came up

during the test could be directly addressed and dis-

cussed.

Table 2: Tasks for the participants.

Tasks

Try to select the circle and the cube together.

Deselect the selected primitives.

Select the same primitives using the object tree.

Rotate a primitive by 90 degrees.

Scale a primitive.

Move a cylinder.

Change an operator from union to subtraction.

(a) Object 1. (b) Object 2. (c) Object 3.

Figure 6: User study objects (Friedrich et al., 2019).

5.3 Interviews

After ﬁnishing the aforementioned tasks, an interview

was conducted in order to gain further insight into the

participants’ user experience. Each interview started

with general questions about pre-existing knowledge

of VR and CAD software which should help with

the classiﬁcation of the given answers (see Table 3).

Statements were assembled based on the catalogue of

Table 3: Selected questions for the interview.

Questions

How much experience do you have in virtual

reality applications?

How much experience do you have in CAD/3D

modeling software?

How do you see our control method compared to a

mouse based interface?

What is your opinion on pointer devices?

questions (Bossavit et al., 2014) (see Table 4) which

served as an opener for discussion. Interviews were

digitally recorded and then transcribed.

5.4 Results

All results discussed in the following were derived

from the interview transcriptions and video record-

HUCAPP 2021 - 5th International Conference on Human Computer Interaction Theory and Applications

124

Table 4: Statements given to initiate discussion.

Statements

I found the technique easy to understand.

I found the technique easy to learn.

I found it easy to select an object.

I found it easy to group multiple objects.

I found translation easy to do.

I found rotation easy to do.

I found scaling easy to do.

I found the tree tool easy to understand.

The tree tool was useful to understand the

structure of the object.

The change of the Boolean operators was easy to

understand.

ings. In general, the interaction design was well re-

ceived by the participants even in this early stage of

prototype development. All users stated that they un-

derstood the basic interaction idea and all features of

the application.

However, it was observed that users had to get

used to the separation between selection and ma-

nipulation mode which was not entirely intuitive to

them. For instance, some participants tried to high-

light primitives in the edit mode. In this case, addi-

tional guidance was needed. All users stated that the

colorization of objects is an important indication to

determine the state of the application and which ac-

tions can be executed.

The information board was perceived as being very

helpful but was actively used by only three partici-

pants. Two users struggled with the information board

not always being in the ﬁeld of view which necessi-

tated a turn of the head to see it. Only one user re-

alized that the current state of the application is dis-

played there as well.

Hand gesture-based model manipulation was very

well received by the users. The hover and grab ges-

tures did not need any explanation. Four users had

trouble using their hands due to sensor range issues.

The voice control was evaluated positively. However,

participants reported that the command recognition

did not always work. The overall success rate was

70.7% (see Table 5). Interestingly, both male partici-

pants have signiﬁcantly higher success rates than the

three female participants which might be due to dif-

ferent pitches of their voices.

In addition, some command words like ’append’

confused the users, because they expected a new

primitive to be appended to the object. This indicates

that an intuitive voice command design is essential

but, at the same time, hard to achieve.

The manipulation technique including the Virtual

Handles was also evaluated positively. All users high-

lighted the intuitiveness of translation and scaling in

particular. Two participants missed the functionality

of uniform scaling. One participant needed an extra

Table 5: Overview of voice control success rates

(success rate =

recog.

recog.+not recog.

). Success means, that a com-

mand was successfully recognized by the system.

User recognized not recog. Success Rate

P1 60 12 83.3%

P2 52 35 59.8%

P3 49 23 68.1%

P4 55 27 67.1%

P5 36 12 75.0%

252 109 ∅70.7%

explanation in order to understand the intention of the

sphere in the center (axis-free object manipulation).

The rotation was not fully transparent to the par-

ticipants. All mentioned that they did not immediately

recognize the selected rotation axis and instead of ro-

tating the cubes at the end of the handle tried to move

the complete handle.

The tree tool was evaluated positively. All partic-

ipants used it to understand and manipulate the struc-

ture of the object. One participant liked that it was

easy-to-use and always accessible. Two participants

rated it as an intuitive tool without having an in-depth

knowledge of what exactly a CSG tree is. However,

one participant could not directly see the connection

between the spheres in the tree and the corresponding

primitives. Thus, we could observe users hovering

random nodes to see the assigned primitive, trying to

select speciﬁc primitives. Three users had difﬁculties

in understanding the Boolean set-operations.

The participants showed different preferences for

highlighting primitives. Two users rather used the tree

tool while the others favored the hover gesture-based

approach. The selection as well as the de-selection

was rated positively mostly for being simple to use.

Especially while dealing with more complex objects,

it was perceived more difﬁcult to select the desired

primitive. This was also due to the size of the ele-

ments and due to a smaller distance to the next prim-

itive, which makes hovering more difﬁcult. It was

also observed that in unfavorable cases, the selection

hand covered the left hand so that the sensor could

no longer detect its palm. As a result, the tree disap-

peared. This made it more difﬁcult for users to select

a sphere in the tree.

5.5 Discussion

5.5.1 Limitations

The accuracy of hand tracking is very important.

Even though only very simple gestures need to be de-

tected, tracking problems occurred repeatedly for no

speciﬁc or obvious reason. This led to problems like

unwanted transformations, especially when the end of

a gesture was not detected correctly. However, the

Combining Gesture and Voice Control for Mid-air Manipulation of CAD Models in VR Environments

125

study participants were able to gain comprehensive

insights in the application’s interaction ﬂow.

Furthermore, the group of participants was rather

small and none of them had any experience with ex-

isting mid-air interaction techniques. Future versions

of the prototype should be tested with more users,

also taking different target user groups into account

(CAD modelling experts and beginners, ...). In addi-

tion, tasks should be performed also using standard

CAD modeling software for comparison.

5.5.2 Findings

The reported ﬂat learning curve shows the potential

of the proposed approach. After just a few explana-

tions and a few minutes of training, users were able

to modify 3D objects, even those participants without

VR or CAD modelling experience.

The Virtual Handle approach is a suitable and in-

tuitive three DOF manipulation tool which can be

used for translation, rotation, and scaling in virtual

environments. For those actions, our prototype needs

to be improved, since the implementation of the rota-

tion transformation was not perceived well.

Using additional voice commands for interacting

with the system is a good way to simplify the graphi-

cal user interface. However, voice interfaces must be

designed carefully in order to be intuitive while com-

mand recognition must work ﬂawlessly.

The examination and manipulation of objects us-

ing the CSG tree structure works well even for

novices. The tree tool can be understood and used

without having any knowledge about the theory of

CSG trees. However, the system quickly reaches its

limits if the object consists of many small primitives.

Furthermore, some users reported that they had difﬁ-

culties imagining intersection and subtraction opera-

tions in advance. In addition, the more complex the

objects become, the more difﬁcult it is to manipulate

them using the proposed method.

6 CONCLUSION

In this work, a novel, multi-modal mid-air manipu-

lation technique for CSG-based 3D objects was pre-

sented. To make the interaction as intuitive as possi-

ble, the developed prototype is based on three control

elements: Unimanual gesture controls via the demon-

strated Virtual Handles, voice control and a virtual

widget we named ’the tree tool’ for direct CSG tree

manipulation. We conducted a qualitative user study

which showed that users learn and understand our

manipulation method very quickly. For future work,

we would like to add missing functionality (e.g., for

adding new primitives), evaluate our approach by

conducting quantitative and comparative user studies,

and extend the voice command concept more in the

direction of natural language dialogue systems.

REFERENCES

Boren, T. and Ramey, J. (2000). Thinking aloud: reconcil-

ing theory and practice. IEEE Transactions on Profes-

sional Communication, 43(3):261–278.

Bossavit, B., Marzo, A., Ardaiz, O., De Cerio, L. D., and

Pina, A. (2014). Design choices and their implications

for 3d mid-air manipulation techniques. Presence:

Teleoperators and Virtual Environments, 23(4):377–

392.

Bowman, D., Kruijff, E., LaViola Jr, J. J., and Poupyrev,

I. P. (2004). 3D User interfaces: theory and practice,

CourseSmart eTextbook. Addison-Wesley.

Bowman, D. A. and Hodges, L. F. (1997). An evaluation of

techniques for grabbing and manipulating remote ob-

jects in immersive virtual environments. In Proceed-

ings of the 1997 symposium on Interactive 3D graph-

ics, pages 35–ff.

Bowman, D. A. and Hodges, L. F. (1999). Formalizing the

design, evaluation, and application of interaction tech-

niques for immersive virtual environments. Journal of

Visual Languages & Computing, 10(1):37–53.

Caputo, F. M. (2019). Gestural interaction in virtual envi-

ronments: User studies and applications. PhD thesis,

University of Verona.

Chu, C.-C., Dani, T. H., and Gadh, R. (1997). Multimodal

interface for a virtual reality based computer aided de-

sign system. In Proceedings of International Confer-

ence on Robotics and Automation, volume 2, pages

1329–1334. IEEE.

Conner, B. D., Snibbe, S. S., Herndon, K. P., Robbins,

D. C., Zeleznik, R. C., and Van Dam, A. (1992).

Three-dimensional widgets. In Proceedings of the

1992 symposium on Interactive 3D graphics, pages

183–188.

Friedrich, M., Fayolle, P.-A., Gabor, T., and Linnhoff-

Popien, C. (2019). Optimizing evolutionary csg tree

extraction. In Proceedings of the Genetic and Evolu-

tionary Computation Conference, GECCO ’19, page

1183–1191, New York, NY, USA. Association for

Computing Machinery.

Houde, S. (1992). Iterative design of an interface for easy 3-

d direct manipulation. In Proceedings of the SIGCHI

conference on Human factors in computing systems,

pages 135–142.

Kim, T. and Park, J. (2014). 3d object manipulation using

virtual handles with a grabbing metaphor. IEEE com-

puter graphics and applications, 34(3):30–38.

Kou, X., Xue, S., and Tan, S. (2010). Knowledge-guided

inference for voice-enabled cad. Computer-Aided De-

sign, 42(6):545–557.

HUCAPP 2021 - 5th International Conference on Human Computer Interaction Theory and Applications

126

LaValle, S. (2017). Virtual Reality, volume 1. Cambridge

University Press.

Lee, M., Billinghurst, M., Baek, W., Green, R., and Woo,

W. (2013). A usability study of multimodal input in

an augmented reality environment. Virtual Reality,

17(4):293–305.

Mendes, D., Caputo, F. M., Giachetti, A., Ferreira, A., and

Jorge, J. (2019). A survey on 3d virtual object manip-

ulation: From the desktop to immersive virtual envi-

ronments. In Computer graphics forum, volume 38,

pages 21–45. Wiley Online Library.

Mendes, D., Relvas, F., Ferreira, A., and Jorge, J. (2016).

The beneﬁts of dof separation in mid-air 3d object

manipulation. In Proceedings of the 22nd ACM Con-

ference on Virtual Reality Software and Technology,

pages 261–268.

Requicha, A. G. (1980). Representations for rigid solids:

Theory, methods, and systems. ACM Comput. Surv.,

12(4):437–464.

Robinett, W. and Holloway, R. (1992). Implementation

of ﬂying, scaling and grabbing in virtual worlds. In

Proceedings of the 1992 symposium on Interactive 3D

graphics, pages 189–192.

Song, P., Goh, W. B., Hutama, W., Fu, C.-W., and Liu, X.

(2012). A handle bar metaphor for virtual object ma-

nipulation with mid-air interaction. In Proceedings of

the SIGCHI Conference on Human Factors in Com-

puting Systems, pages 1297–1306.

Wang, R., Paris, S., and Popovi

c, J. (2011). 6d hands: mark-

erless hand-tracking for computer aided design. In

Proceedings of the 24th annual ACM symposium on

User interface software and technology, pages 549–

558.

Xue, S., Kou, X., and Tan, S. (2009). Natural voice-enabled

cad: modeling via natural discourse. Computer-Aided

Design and Applications, 6(1):125–136.

Zheng, J., Chan, K., and Gibson, I. (2000). A vr-based

cad system. In International Working Conference on

the Design of Information Infrastructure Systems for

Manufacturing, pages 264–274. Springer.

Zhong, Y. and Ma, W. (2004). An approach for solid

modelling in a virtual reality environment. In Virtual

and Augmented Reality Applications in Manufactur-

ing, pages 15–42. Springer.

Combining Gesture and Voice Control for Mid-air Manipulation of CAD Models in VR Environments

127