The VVAD-LRS3 Dataset for Visual Voice Activity Detection

Adrian Lubitz

1 a

, Matias Valdenegro-Toro

2 b

and Frank Kirchner

1,3 c

Department of Computer Science, University of Bremen, 28359 Bremen, Germany

Department of AI, University of Groningen, 9747 AG Groningen, The Netherlands

Robotics Innovation Center, German Research Center for Artiﬁcial Intelligence, Bremen, Germany

Keywords:

Human-Robot Interaction, Perception, Dataset, Deep Learning.

Abstract:

Robots are becoming everyday devices, increasing their interaction with humans. To make human-machine

interaction more natural, cognitive features like Visual Voice Activity Detection (VVAD), which can detect

whether a person is speaking or not, given visual input of a camera, need to be implemented. Neural networks

are state of the art for tasks in Image Processing, Time Series Prediction, Natural Language Processing and

other domains. Those Networks require large quantities of labeled data. Currently there are not many datasets

for the task of VVAD. In this work we created a large scale dataset called the VVAD-LRS3 dataset, derived

by automatic annotations from the LRS3 dataset. The VVAD-LRS3 dataset contains over 44K samples, over

three times the next competitive dataset (WildVVAD). We evaluate different baselines on four kinds of features:

facial and lip images, and facial and lip landmark features. With a Convolutional Neural Network Long Short

Term Memory (CNN LSTM) on facial images an accuracy of 92% was reached on the test set. A study with

humans showed that they reach an accuracy of 87.93% on the test set.

1 INTRODUCTION

Technology is integrating more and more into the life

of the modern man. A very important question is

how are people interacting with technology. The hu-

man brain does not react emotionally to artiﬁcial ob-

jects like computers and mobile phones. However, the

human brain reacts strongly to human appearances

like shape of the human body or faces (gun Choi

and Kim, 2009). Therefore humanoid robots are the

most natural way for human-machine interaction, be-

cause of the human-like appearance. This hypothesis

is strongly supported by HRI Research from (Kanda

and Ishiguro, 2017), (Ángel Pascual del Pobil Ferré

et al., 2013), (OZTOP et al., 2005) and (Miwa et al.,

2003). They see Social Robots as a part of the future

society. (Kanda and Ishiguro, 2017) also deﬁnes the

following three issues which need to be solved to bring

social robots effectively and safely to the everyday life:

a. Sensor network for tracking robots and people

Development of humanoids that can work in the

daily environment.

https://orcid.org/0000-0003-0609-2850

https://orcid.org/0000-0001-5793-9498

https://orcid.org/0000-0002-1713-9784

Development of functions for interactions with

people.

This paper is located in the ﬁeld

, as we propose

a large scale dataset to train models for the task of

Visual Voice Activity Detection (VVAD) which detects

whether a person is speaking to a robot or not, given

the visual input of the robot’s camera.

VVAD is an important cognitive feature in a

Human-Robot Interaction(HRI). As we want Robots

to integrate seamlessly into our society, Human-Robot

Interaction needs to be as close as possible to Human-

Human Interaction (HHI). VVAD can be used for

speaker detection in the case where multiple people

are in the robot’s ﬁeld of view. Furthermore it can

be usefull to detect directed speech in noisy environ-

ments.

In this paper we present a new benchmark for the

VVAD task, produced from the LRS3 dataset (Tri-

antafyllos Afouras, 2018) which contains TED Talks,

and by using the provided textual transcripts, we can

extracts parts of the TED Talk video in order to gen-

erate positive/negative video samples for the VVAD

task. Our dataset contains 37.6K training and 6.6K val-

idation samples, making it the largest VVAD dataset

currently. The dataset will be publicly available on the

internet. We provide baseline models using commonly

Lubitz, A., Valdenegro-Toro, M. and Kirchner, F.

The VVAD-LRS3 Dataset for Visual Voice Activity Detection.

DOI: 10.5220/0011612900003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 2: HUCAPP, pages

39-46

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

Figure 1: Example of error detection - Person is classi-

ﬁed as having a mouth activity, however does not speak

(Meriem Bendris and Chollet, 2010).

used neural network architectures. In an experimental

setup with a CNN LSTM an accuracy of 92% was

reached on the test set. A study with humans showed

that humans reach an accuracy of 87.93% on the test

set.

This paper contributes a large scale dataset and

a simple approach on how to use it for VVAD.

2 RELATED WORK

The classic approach to solve VVAD is to detect lip

motion. This approach is taken by F. Luthon and M.

Liévin in (F. Luthon, 1998). They try to model the

motion of the mouth in a sequence of color images

with Markov Random Fields. For the lip detection

they analyze the images in the HIS (Hue, Intensity,

Saturation) color space, with extracting close-to-red-

hue prevailing regions this leads to a robust lighten-

ing independent lip detection. A different approach

was taken by Spyridon Siatras, Nikos Nikolaidis, and

Ioannis Pitas in (Spyridon Siatras and Pitas, 2006).

They try to convert the problem of lip motion detec-

tion into a signal detection problem. They measure

the intensity of pixels of the mouth region and classify

with a threshold, since they argue that frames with

an open mouth have a essentially higher number of

pixels with low intensity. In (Meriem Bendris and

Chollet, 2010) Meriem Bendris, Delphine Charlet and

Gérard Chollet propose a method, which measures the

probability of voice activity with the optical ﬂow of

pixels in the mouth region. In (Meriem Bendris and

Chollet, 2010) the drawback of lip motion detection

based approaches is already discussed. As shown in

Figure 1, what makes the problem difﬁcult is that peo-

ple move their lips from time to time although they are

not speaking.

This issue is tackled by Foteini Patrona, Alexan-

dros Iosiﬁdis et al. (Patrona et al., 2016). They

use a Space Time Interest Point (STIP) or the Dense

Trajectory- based facial video representation to train

a Single Hidden Layer Feedforward Neural Network.

The features are generated from the CUAVE dataset

(Patterson et al., 2002). This erases the implicit as-

sumption (of the approaches above) that lip motion

equals voice activity. A more robust approach, which

uses Centroid Distance Features of normalized lip

shape to train a LSTM Recurrent Neural Network is

proposed by Zaw Htet Aung and Panrasee Ritthipravat

in (Aung and Ritthipravat, 2016). This method shows a

classiﬁcation accuracy up to 98% on a relatively small

dataset. In conclusion all of the mentioned methods

use some kind of face detection and some also use

mechanics to track the face. This is needed if there

is more than one face in the image. From the facial

images features are created in different ways. From

that point the approaches divide into two branches.

The ﬁrst and naive approach is to assume that lip mo-

tion equals speech. This is obviously not always the

case, which is why the later approaches do not rely

on this hypothesis. The latter approach uses learning

algorithms to learn the real mapping between facial

images and the speech/no speech. This approach is

strongly relying on a balanced dataset to learn a good

performing model.

While datasets like the LRS3 or CUAVE (Patterson

et al., 2002) provide a good ﬁt for lipreading they lack

the negative class for VVAD. There seems to be not

many datasets for the VVAD task. The only compet-

itive state of the art dataset for VVAD that we found

was the WildVVAD (Guy et al., 2020). WildVVAD

is not only 3 times smaller than the VVAD-LRS3 it

is also more prone false positive and false negative

because of the loose assumption that detected voice

activity and a single face in the video equals a speaking

sample and every detected face in a video sequence

without voice activity is a not speaking sample. Fur-

thermore the source WildVVAD is drawn from makes

it less diverse. Table 1 shows a comparison of state of

the art datasets. The VVAD-LRS3 that we propose in

this paper is ∼ 3× larger than WildVVAD.

3 DATASET CAPTURE

To create the large scale VVAD dataset we took the

Lip Reading Sentences 3 (LRS3) Dataset introduced

by Afouras et al. in (Triantafyllos Afouras, 2018) as a

basis. The LRS3 is a dataset designed for visual speech

recognition and is created from videos of 5594 TED

and TEDx talks. It provides more than 400 hours video

material of natural speech. The LRS3 dataset provides

videos along with metadata about the face position and

a speech transcript. In the LRS3 metadata ﬁles the

following ﬁelds are important for the transformation

to the VVAD dataset:

HUCAPP 2023 - 7th International Conference on Human Computer Interaction Theory and Applications

Table 1: Overview of state of the art datasets for VVAD.

Dataset Samples Diversity Pos/Neg Ratio

VVAD-LRS3 (this work) 44,489 Very high 1-to-1

WildVVAD (Guy et al., 2020) 13,000 High 1-to-1

LRS3 (Triantafyllos Afouras, 2018) >100,000 Very high 1-to-0

CUAVE (Patterson et al., 2002) ∼7,000 Low 1-to-0

Table 2: Number of samples for training, validation and test

splits of the VVAD-LRS3 dataset.

Training Set Validation Set Test Set

37646 Samples 6643 Samples 200 Samples

Text. contains the text for one sample. The length of

the text or respectively the sample is deﬁned by length

of the scene. That means one sample can get as long

as the face is present in the video.

Ref. is the reference to the corresponding YouTube

video. The value of this ﬁeld needs to be appended to

https://www.youtube.com/watch?v=

FRAME. corresponds to the face bounding box for

every frame, where

FRAME

is the frame number,

and

Y is the position of the bounding box in the video and

and

are the width and height of the bounding box

respectively. It is to mention that for the frame number

a frame rate of 25 fps is assumed and the values for

and

are a percentage indication of the width and

height of the video.

WORD. maps a timing to every said word. Here

START

and

END

indicate the start and end of the word

in seconds respectively. It is to mention that the time

is in respect to the start of the sample given by the ﬁrst

frame and not to the start of the whole video.

The LRS3 dataset comes with a low bias towards

speciﬁc ethnic groups, because TED and TEDx talks

are international and talks are held by men and women

as well as small children. It also comes with the advan-

tage that it depicts a large variety of people because

the likelihood of talking in multiple TED or TEDx

talks is rather small. This is a big advantage over the

LRS2 and LRW dataset that are extracted from regular

TV shows, which brings the risk of overﬁtting to a

speciﬁc persons. LRS3 makes learning more robust in

that sense. Since natural speech in front of an audience

includes pauses for applause and means to structure

and control a speech as described in (Nikitina, 2011),

the LRS3 dataset provides speaking and not speaking

phases.

To transform LRS3 samples to VVAD ones the

given text ﬁles are analyzed for these speaking and

not speaking phases. In (Zellner, 1994) Brigitte Zell-

ner shows that pauses occur in natural speech and

explicitly in speech in front of an audience. This leads

to two constants we need to deﬁne in the context of

pauses. The ﬁrst is

maxPauseLength

which deﬁnes

the maximal length of a pause which is still considered

to be an inter speech pause. In consideration of the

different types of pauses mentioned in (Zellner, 1994)

maxPauseLength

is set to 1 s. The second constant is

sampleLength

which deﬁnes the length of a sample.

In other words this deﬁnes how long a pause should be

to be considered as a negative (not speaking) sample

or how long a speech phase needs to be to be consid-

ered a positive (speaking) sample. It shows that most

of the pauses have a length between 1.5 s and 2.5 s,

therefore

sampleLength

is set to 1.5 s to get the most

out of the LRS3 dataset. The extraction of positive and

negative samples for the VVAD starts only on textual

basis. Theoretically the whole extraction of the data

could work on this basis but the given bounding boxes

where very poor. To overcome this problem face de-

tection and tracking was remade using dlib’s (King,

2009) correlation ﬁlter based tracker and face detector.

We provide for different kinds of features derived from

the tracked face and facial features:

•

Face Images. The whole image resized and zero

padded to a speciﬁc size.

•

Lip Images. An image of only the lips resized and

zero padded to a speciﬁc size.

•

Face Features. All 68 facial landmarks extracted

with dlib’s facial landmark detector

•

Lip Features. All facial landmarks concerning the

lips extracted with dlib’s facial landmark detector

For the face images the input image only needs to be

resized and zero padded to a given size. As depicted in

Figure 2 the predictor extracts facial shape given by 68

landmarks, while 20 of these landmarks describe the

lips. The predictor is trained on the ibug 300-W face

landmark dataset

. For the lip images the minimal

Available at https://ibug.doc.ic.ac.uk/resources/facial-

point-annotations/

The VVAD-LRS3 Dataset for Visual Voice Activity Detection

(a) Face Images feature (b) Lip Images feature

Figure 2: Visualization of one frame of different features.

Table 3: Dimensionality for the different features with ts as

the number of timestamps, d as the image dimensions.

ts d dtype

Face Images 38 200× 200× 3 uint8

Lip Images 38 100× 50 × 3 uint8

Face Features 38 68× 2 float64

Lip Features 38 20× 2 float64

values in x- and y-direction are taken as the upper left

corner of the lip image, while the lower right corner is

deﬁned by the maximal values in x- and y-direction.

The face features are taken directly from the landmarks

given from dlib’s facial landmark detector. For the lip

features only landmark 49 to landmark 68 are taken

into account, because they fully describe the lip shape

as seen in Figure 2. It is to mention that it is useful to

normalize the features for face features and lip features

when applied to a learning algorithm.

With this approach we could create 22,245 negative

(not speaking) samples and 22,244 positive (speaking)

samples which is equal to 18.5 h of learning data in

total. While the theoretical number of positive samples

is way higher we were aiming for a balanced dataset

and experimental results show that this is sufﬁcient.

Table 2 shows the number of samples on the training,

validation and test sets. Figure 3 shows a random

selection of 10 positive and negative images from the

training set. Table 3 shows the dimensionality of one

sample for the different features.

We evaluate two important hyper-parameters of

our dataset to examine their relation with learning

performance:

Image Size. The optimal image size is evaluated

for MobileNets (Howard et al., 2017) using image

sizes starting from

32 × 32

to the maximal image size

200

with a step size of 32. Figure 4a shows that

the maximal accuracy in the spatial domain can be

reached using a image size of around 160 × 160.

Number of Frames. Figure 4b shows how ac-

curacy improves over the number of frames for a

TimeDistributed MobileNet on

96 × 96

pixel images

(limited by available GPU memory). These results

show that the VVAD task requires many frames for an

accurate prediction and speaking cannot be inferred

from a low number of frames.

Taking Figure 4b and 4a into account the optimal

values for the image size and number of frames are

160 × 160 and 36 respectively.

Dataset Construction. To test the dataset with

different models we created the following four features

that are available directly on our dataset:

Face Images used for the most sophisticated model.

These Face Images come in a maximal resolution of

200 × 200

pixels and with a maximal number of 38

frames. So the maximum shape of one sample of

the Face Images feature is

38 frames × 200 pixels ×

200 pixels× 3 channels

= 4.56 MB. Pixel values range

between 0 and 255 which can be represented with one

byte.

Lip Images are also used for an end-to-end learning

approach but they obviously concentrate on a small

subset of the Face Images. Lip Images are RGB im-

ages with a maximum of 38 frames but they have a

maximal resolution of

100 × 50

pixels. This resolves

38 frames × 100 pixels × 50 pixels × 3 channels

0.57 MB.

Face Features are used for the learning approach

which focuses on facial features.

We provide 68 landmarks with a

(x, y)

position for

a single face as depicted in Figure 2. A single feature

is given as ﬂoat64 (8 bytes), given by

38 frames ×

68 features × 2 dimensions × 8 bytes = 41.4 KB.

Lip Features are a small subset of Face Features

that only take the features of the lips into account.

dlib’s facial landmark detector reserves 20 features for

the lips as shown in Figure 2. This results in the size

38 frames× 20 features× 2 dimensions× 8 bytes

12.1 KB for a single sample in the lip features ﬂavor.

Test Set and Human-Level Accuracy. To test

the VVAD-LRS3 dataset a human accuracy test was

performed. The test is built with a randomly seeded

subset of 200 samples that is not part of the train/val

splits, and we used 10 persons to produce predictions

for this set. The overall human accuracy level was

87.93%, while the human accuracy level on positive

HUCAPP 2023 - 7th International Conference on Human Computer Interaction Theory and Applications

(a) Positive Examples

(b) Negative Examples

Figure 3: Random selection of speaking (positive) and negative (not speaking) samples from the VVAD-LRS3 dataset.

(a) Evaluation of the optimal image size with a single frame.

(b) Validation Accuracy for CNN LSTM over the number of

timesteps/frames used.

Figure 4: Comparison of performance as image size and

number of timesteps/frames is varied on MobileNet.

samples is 91.44%, and the human accuracy level on

negative samples is only 84.44%.

This shows, that the automatic extraction of the

negative samples is more prone to errors than the au-

tomatic extraction of positive samples. This is due to

the purpose of the LRS3 as a lipreading dataset which

obviously offers more positive samples than negative

samples for a VVAD dataset.

In the human accuracy test some of the samples

Figure 5: Sample

6178

is labeled as a negative (not speak-

ing) sample by the automatic transformation from LRS3

to VVAD dataset. On the human accuracy level test 100%

of the subjects classiﬁed the sample as positive (speaking)

sample. Beat boxing is not considered speech in the LRS3

dataset, which causes the wrong label.

were labeled incorrectly or at least were classiﬁed with

the opposite class label. A closer look is taken into

four of these samples from the test set. While the sam-

ples

31366

and

42768

are labeled positive from the

automatic transformation from the LRS3 sample to

the VVAD sample they were classiﬁed as negative by

all the subjects in the human accuracy level test. For

the samples

14679

and

6178

the opposite is the case.

On further investigation it was seen that sample

14679

and

42768

are obviously wrong labeled while sam-

ple

31366

and

6178

have some special properties that

make them perform very bad on the human accuracy

level test. Sample

31366

has a very quick head move-

ment which makes it very hard to see the very little

movements of the mouth to produce speech. Sample

6178

shows a person obviously producing sound with

his mouth. But the sound here is no speech but beat

boxing which is not considered speech in the original

LRS3 dataset. Sample

6178

and

31366

are depicted

in Figure 5 and 6 respectively.

The VVAD-LRS3 Dataset for Visual Voice Activity Detection

Figure 6: Sample

31366

is labeled as a positive (speak-

ing) sample by the automatic transformation from LRS3 to

VVAD dataset. On the human accuracy level test 100% of

the subjects classiﬁed the sample as negative (not speaking)

sample. The fast movement of the head while producing

only a small movement of the lips causes the wrong label.

4 INITIAL EXPERIMENTAL

RESULTS

Pre-trained Models. To show that the dataset can be

efﬁciently used to train a VVAD, we implemented and

trained CNN-LSTM models with our dataset as base-

lines. As described earlier speech cannot be effectively

classiﬁed with a single image, which motivates the use

of recurrent neural networks.

We evaluate the use of LSTM cells, as described in

(Hochreiter and Schmidhuber, 1997). We use standard

architectures as a backbone, which are wrapped by a

TimeDistributed wrapper in order to transform them

into a recurrent network that can process a sequence

of images. TimeDistributed is a wrapper provided by

Keras (Chollet et al., 2015) which basically copies

a model for all timesteps, to effectively handle time

series and sequences. The sequence can be processed

by a LSTM Layer to make temporal sense, while the

last Dense Layer is used to make the classiﬁcation.

Experiments have shown that a single Dense layer

with 512 units on top of a LSTM layer with 32 units

show good results. We use a

200 × 200

pixels input

image size on one or two frames for initial testing.

We use DenseNet (Huang et al., 2018), MobileNet

(Howard et al., 2017) and VGGFace (Parkhi et al.,

2015) as backbone networks in the TimeDistributed

wrapper. These models are pre-trained and used as is

from the keras-applications library.

All models were trained using Stochastic Gradient

Descent, with a starting learning rate

α = 0.01

and

decaying as needed. Models were trained until con-

vergence, which varied between 80 to 200 epochs. A

binary cross-entropy loss is used, and each network

has an output layer with a single neuron and a sigmoid

activation. All architectures and hidden layers use a

ReLU activation.

Our results are presented in Table 4. It shows

that DenseNet, MobileNet and VGGFace improve by

around 2.3% using one more frame. Our results also

shows that MobileNetV1 and DenseNet121 perform

better than the corresponding model alteration. We

will refer as MobileNet and DenseNet to MobileNetV1

and DenseNet121 respectively.

End-to-End Learning. In this section we evaluate

end-to-end models trained from scratch, using not just

face images but also other features such as lips and

their features. Since evaluating for all 38 frames is

not always possible (depending on access to GPUs

with large amounts of RAM), only the MobileNet as

the smallest of the base models is taken further into

consideration. For this experiment we use

96 × 96

input image sizes for image features.

In comparison MobileNet contains approximately

4.2 million parameters while DenseNet requires

around double the amount with 8 million parameters

and VGGFace has over 50 million parameters. Know-

ing this the MobileNet is a good compromise between

performance and size, because it is able to consider

more timesteps, which in the end can lead to even

higher accuracy.

For the face and lip images a TimeDistributed Mo-

bileNet is used, while the approaches learning on the

vector features (facial and lip features) we use a single

LSTM layer with 32 units and a single Dense layer

with 512 units. Training methodology is the same as

pre-trained models.

Our results are presented in Table 5 It shows that

even with the substantially smaller face features a val-

idation accuracy of 89.79% can be reached which is

still higher than the human accuracy level. With this

end-to-end learning approach on face images we were

able to reach a very high accuracy of 92% on the test

set. This is higher than the reported human-level accu-

racy on the same dataset.

One interesting remark from our results is that

learning from image data, even if it is from scratch,

seems to outperform the use of facial or lip features

by approximately 3%, which we believe makes sense

since an image might contain additional information

that the pure facial or lip features do not contain. This

shows the importance of using visual models for this

problem.

Prediction Analysis. The classiﬁcations of the

samples from the test set can be seen in Figure 7. The

ﬁrst 100 samples are negative samples while the last

HUCAPP 2023 - 7th International Conference on Human Computer Interaction Theory and Applications

Table 4: Validation accuracies for the different baseline

models using only one or two frame in full resolution (

200×

200

pixels). Increased accuracy highlights the importance of

the temporal domain in VVAD.

Baseline Model 1-Frame Acc 2-Frame Acc

DenseNet201 73.08 % -

DenseNet121 73.17 % 75.34 %

MobileNetV2 67.45 % -

MobileNetV1 69.56 % 72.11%

VGGFace 71.96 % 74.36 %

Figure 7: Visualization of predictions on the test set for

MobileNet trained on face images. Each arrow represents

the prediction conﬁdence, with the ﬁrst 100 samples being

negative (not speaking), and the remaining 100 samples

being positive (speaking).

100 samples are positive ones and the arrows show

the probability, given by the model, that this sample

belongs to the positive class. The red dashed line is

the decision boundary on which the model decides

its classiﬁcations. This visualizes how certain the

model is with its predictions. Many predictions are

incorrect with a high conﬁdence, indicating overcon-

ﬁdence, which also motivates the use of properly cal-

ibrated and Bayesian neural network models (Matin

and Valdenegro-Toro, 2020)

5 CONCLUSIONS AND FUTURE

WORK

In this work we present the construction of the VVAD-

LRS3 dataset using an automated pipeline to construct

VVAD samples from LRS3 samples, we also show

that these samples are not labeled perfectly, but they

can still be used to learn a robust VVAD system. The

VVAD-LRS3 dataset provides four kinds of features:

facial and lip images, and facial and lip landmark fea-

tures.

We provide baselines on our dataset using pre-

Table 5: Overview of the validation performance of the

different features on MobileNet using all 38 frames at

96 ×

96 pixels.

Feature Validation Acc Test Acc

Face Images 94.05 % 92.0 %

Lip Images 93.98 % 92.0 %

Face Features 89.79 % 89.0 %

Lip Features 89.93 % 89.0 %

Human Level - 87.93 %

trained and end-to-end neural network architectures

on all feature kinds. Face images with end-to-end

architectures seem to perform best with a validation

accuracy of up to

94%

, while landmark features on

face and lips seem to perform the worse at around

89%

validation accuracy. We also show that up to 38

frames are required to obtain the highest predictive

performance for this task.

Although the performance shows to be better than

human accuracy and the presented solutions seem to

be robust enough to handle outliers it may be possi-

ble to improve the results with a cleaned dataset. The

cleaning can be done by manually testing all labels

and correct or remove wrong labeled samples or by en-

hancing the algorithm to reduce the number of wrong

labels.

Due to the comparability of the test results with

the human accuracy level it was only possible to use

the 200 randomly seeded samples used for the human

accuracy test as the test set for the trained models,

although it was described as best practice to hold back

at least 10% of the data for testing. If the test set would

be bigger and comparability to the human performance

can be secured the test results would have an even

stronger meaning than right now. A larger amount of

samples that were tested on humans would make it

possible to examine the relationship between DNNs

for VVAD and the human brains approach to VVAD

more closely. Furthermore it is hard to determine a

ground truth for the data because human classiﬁcation

varied for some of the samples. But in general the

human classiﬁcation and the data creation through the

automatic pipeline have a signiﬁcant similarity which

allows us to use the data effectively as is. Experiments

with trained models in real human-robot interaction

can hopefully be conducted in the future. We hope that

the community beneﬁts from our dataset and is able to

produce learning algorithms that can produce a robust

VVAD system for social robots.

The dataset is publicly available under

https://tinyurl.com/mucfmfyx. With a large

scale publicly available dataset for VVAD the

research on this topic can be massively accelerated.

The VVAD-LRS3 Dataset for Visual Voice Activity Detection

Furthermore we were able to publish some of

the trained models on PyPI (PSF, 2022) under

https://pypi.org/project/vvadlrs3/ to make it easier to

develop applications.

REFERENCES

Aung, Z. H. and Ritthipravat, P. (2016). Robust visual voice

activity detection using long short-term memory recur-

rent neural network. In Revised Selected Papers of the

7th Paciﬁc-Rim Symposium on Image and Video Tech-

nology - Volume 9431, PSIVT 2015, pages 380–391,

New York, NY, USA. Springer-Verlag New York, Inc.

Chollet, F. et al. (2015). Keras. https://keras.io.

F. Luthon, M. L. (1998). Lip motion automatic detection.

gun Choi, J. and Kim, M. (2009). The usage and evalu-

ation of anthropomorphic form in robot design. In

Undisciplined! Design Research Society Conference

2008.

Guy, S., Lathuilière, S., Mesejo, P., and Horaud, R. (2020).

Learning visual voice activity detection with an auto-

matically annotated dataset.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural Computation, 9(8):1735–1780.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,

W., Weyand, T., Andreetto, M., and Adam, H. (2017).

Mobilenets: Efﬁcient convolutional neural networks

for mobile vision applications.

Huang, G., Liu, Z., van der Maaten, L., and Weinberger,

K. Q. (2018). Densely connected convolutional net-

works.

Kanda, T. and Ishiguro, H. (2017). Human-Robot Interaction

in Social Robotics. CRC Press.

King, D. E. (2009). Dlib-ml: A machine learning toolkit.

Journal of Machine Learning Research, 10:1755–1758.

Matin, M. and Valdenegro-Toro, M. (2020). Hey Human, If

your Facial Emotions are Uncertain, You Should Use

BNNs! In Women in Computer Vision @ ECCV.

Meriem Bendris, D. C. and Chollet, G. (2010). Lip activ-

ity detection for talking faces classiﬁcation in tvcon-

tent. 3rd International Conference on Machine Vision

(ICMV), pages 187–190.

Miwa, H., Okuchi, T., Itoh, K., Takanobu, H., and Takan-

ishi, A. (2003). A new mental model for humanoid

robots for human friendly communication introduc-

tion of learning system, mood vector and second or-

der equations of emotion. In 2003 IEEE Interna-

tional Conference on Robotics and Automation (Cat.

No.03CH37422), volume 3, pages 3588–3593 vol.3.

Nikitina, A. (2011). Successful Public Speaking. bookboon.

OZTOP, E., FRANKLIN, D. W., CHAMINADE, T., and

CHENG, G. (2005). Human–humanoid interaction: Is

a humanoid robot perceived as a human? International

Journal of Humanoid Robotics, 02(04):537–559.

Parkhi, O. M., Vedaldi, A., and Zisserman, A. (2015). Deep

face recognition. In British Machine Vision Confer-

ence.

Patrona, F., Iosiﬁdis, A., Tefas, A., Nikolaidis, N., and Pitas,

I. (2016). Visual voice activity detection in the wild.

IEEE Transactions on Multimedia, 18(6):967–977.

Patterson, E. K., Gurbuz, S., Tufekci, Z., and Gowdy, J. N.

(2002). Cuave: A new audio-visual database for multi-

modal human-computer interface research. 2002 IEEE

International Conference on Acoustics, Speech, and

Signal Processing, 2:II–2017–II–2020.

PSF, P. S. F. (2022). The python package index (pypi).

Python package repository.

Spyridon Siatras, N. N. and Pitas, I. (2006). Visual speech

detection using mouth region intensities. 14th Euro-

pean Signal Processing Conference (EUSIPCO 2006),

EURASIP.

Triantafyllos Afouras, Joon Son Chung, A. Z. (2018). Lrs3-

ted: a large-scale dataset for visual speech recognition.

In arXiv:1809.00496v2 [cs.CV] 28 Oct 2018.

Zellner, B. (1994). Pauses and the temporal structure of

speech.

Ángel Pascual del Pobil Ferré, Bou, M. D., Anna Stenzel,

Eris Chinellato, Markus Lappe, and Roman Liepelt

(2013). When humanoid robots become human-like in-

teraction partners: Corepresentation of robotic actions.

page 18.

HUCAPP 2023 - 7th International Conference on Human Computer Interaction Theory and Applications