Novel Lie Speech Classiﬁcation by using Voice Stress

Felipe Mateus Marcolla

, Rafael de Santiago

2 a

and Rudimar Lu

ıs Scaranto Dazzi

Escola do Mar, Ci

encia e Tecnologia, Universidade do Vale do Itaja

ı, Itaja

ı, Brazil

Departamento de Inform

atica e Estat

ıstica, Universidade Federal de Santa Catarina, Florian

opolis, Brazil

Keywords:

Voice Stress Analysis, Neural Network, Lie Detection.

Abstract:

Lie detection is an open problem. Many types of research seek to develop an efﬁcient and reliable method to

solve this problem successfully. Among the methods used for this task, the polygraph, voice stress analysis,

and pupil dilation analysis can be highlighted. This work aims to implement a neural network to perform the

analysis of a person’s voice and to classify his speech as reliable or not. In order to reach the objectives, a

recurrent neural network of LSTM architecture was implemented, based on an architecture already applied in

other works, and through the variation of parameters, different results were found in the tests. A database with

audio recordings was generated to perform the neural network training, from an interview with a randomly

selected group. Considering all the neural network base models implemented, the one that showed prominence

presented a precision of 72.5% of the data samples. For the type of problem in focus, which is voice stress

analysis, the result is statistically signiﬁcant and denotes that it is possible to ﬁnd patterns in the voice of

people who are under stress.

1 INTRODUCTION

John Larson created the ﬁrst polygraph in 1921. It is

an equipment that detects if a person is lying. It reads

physiological disturbs of the body of the target per-

son. A polygraph supposes that telling a lie provokes

stress, and it can be read (Council, 2003).

Regular polygraph equipment has a paper ribbon

in which the signals captured from the target are writ-

ten. The signals can represent respiratory frequency,

heart frequency, blood pressure, sweating. Some ver-

sions of polygraphs read the movement of arms and

legs. An interpreter read the results on the ribbon to

judge if the answers of the target are reliable (Ofﬁce

of Technology Assessment’s, 1983).

Generally, a polygraph is not used to decide if a

person is telling a lie in serious places because there

is no conclusive proof that it does not fail. It also

can demand hours to ﬁnish a test. Polygraphs cannot

be used on video and audio resources because they re-

quire the presence of the target to measure his signals.

Some software versions of polygraphs have been pro-

posed to deal with these problems. Some versions use

the voice stress analysis (Damphousse, 2009).

Voice stress analysis detects stress or threat in a

subject. His body reacts, and his muscles are ready

https://orcid.org/0000-0001-7033-125X

to get in action. These preparations also affect the

voice because of the tension in the respiratory system

and tissues. For this reason, the voice can be used to

detect stress (Liu, 2004).

Softwares called VSA (Voice Stress Analysis) has

the objective to measure the disturbs from the voice

pattern of a subject. They can be caused by physical

stress that is triggered when lying. The software in-

terprets the disturbs of the voice and give the result

(Damphousse, 2009).

The work reported in this paper uses the voice

analysis of individuals to detect stress levels and clas-

sify the spoken information into two states: truth

or lie. For doing this, Long Short-Term Memory

(LSTM) neural networks were speciﬁed, developed,

and trained. The results show interesting levels of ac-

curacy.

Given the context, the research problem can be

simpliﬁed to a single question: “Would it be possible

to detect a lie in an individual’s speech by analyzing

stress in his voice during his speech using a neural

network? If so, how signiﬁcant can the results be?”

742

Marcolla, F., de Santiago, R. and Dazzi, R.

Novel Lie Speech Classiﬁcation by using Voice Stress.

DOI: 10.5220/0009038707420749

In Proceedings of the 12th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2020) - Volume 2, pages 742-749

ISBN: 978-989-758-395-7; ISSN: 2184-433X

2 BACKGROUND

In this section, the background related to the research

context is reported. Voice concepts are treated in

section “Pitch, Jitter and Voice Stress”. The neural

networks selected to develop our experiments are re-

ported in section “Long-Short Term Memory Neural

Networks”. Finally, some related works are presented

and discussed.

2.1 Pitch, Jitter and Voice Stress

The fundamental frequency of the voice, or f 0 as it is

also known, is a property of sound. It is the smallest

periodic component resulting from vocal fold vibra-

tions. In voice properties, f 0 can indicate the pitch of

a sound, which makes it possible to classify it as high

or low. It can also indicate the loudness of the sound

and can determine if a sound is loud or weak. Pitch is

the auditory perception that gives the sensation of the

pitch of the sound, making the listeners realize if the

sound is low or high (Kremer and Gomes, 2014).

Jitter, or micro-tremors, are involuntary voice

changes being determined by involuntary fundamen-

tal frequency changes over a short period. That is,

jitter is a disturbance or oscillation of the pitch of the

voice (Teixeira et al., 2011).

The concepts of voice stress analysis originated

from the fact that when a person is under a fearful

situation, the body prepares for the ﬁght, which in-

creases the defense intent of some muscles. These

changes can affect muscle tension and speech organs,

such as breathing. Therefore, it can be possible to ver-

ify if a person is stressed, and this stress can be caused

because of false answer from the subject (a lie), just

by analyzing their voice (Liu, 2004).

2.2 Mel-Frequency Cepstral

Coefﬁcients (MFCC)

The MFCC method was ﬁrst mentioned by Bridle and

Brown in 1974, and later in more detail by Davis and

Mermelstein in 1980. To perform a feature set extrac-

tion with all the information present in a voice signal,

the technique MFCC uses the “mel-scale” to analyze

the distinct characteristics present in the spectrum.

Mel is a unit of measure for frequency or peaks

perceived by the human ear in a tone, and the mel-

scale came up to map this frequency. This scale seeks

to approximate the sensitivity characteristics of the

human ear, as it has been analyzed that a linear scale

does not represent the human perception of pure tone

frequencies of voice signals. For a tone with a fre-

quency f, measured in Hz, a subjective tone, measured

on a mel-scale, is deﬁned (Cardoso, 2009).

The scepter can be characterized as the spectrum

of a spectrum. Mel cepstral coefﬁcients (MFCC) can

be deﬁned as coefﬁcients derived from a type of cep-

stral representation of the signal. For this purpose, a

logarithmic scale is used, to transform the frequency

scale to give less emphasis to the high frequencies,

thus bringing the model closer to the perception of

functioning of the human ear, because the frequencies

are perceived by it non- linear (Tiwari, 2009).

Substantially it is possible to deﬁne the cepstrum

of a signal as a transformation over the signal spec-

trum, which induces two chain operations (Childers

et al., 1977). A cepstral mobile element can be deter-

mined as the cepstral power of a mobile scale audio

frequency range. To perform the cepstral calculation,

the cepstral mel-scale elements of each frame, from

point to point, from a spectrum interval resulting from

the application of a frequency centered ﬁlter on the

mel-scale, by means of the Fourier transform module.

Subsequently, the ﬁltered spectrum logarithm and the

type 2 discrete cosine transform are calculated (Rao

and Yip, 1990).

2.3 Long-Short Term Memory Neural

Networks

LSTM is a recurring neural network architecture. It

was introduced by Hochreiter and Schmidhuber in

1997 to minimize the vanishing gradient problem,

which occurs when the network has no memory of

what happened in previous steps; thus it cannot prop-

agate dependencies through the entire data sequence

(Hochreiter et al., 2001).

LSTM networks arose from the idea of the archi-

tecture of typical neural networks composed of neu-

rons. However, the concept of memory cell was in-

troduced to represent their processing units, instead

of neurons. This cell can hold a value for a short or

a long time as an input function of its cells, which

enables the cell to remember important information,

not just the current value being computed. A memory

block consists of one or more cells, which compute

inputs and outputs for all cells in that block (Hochre-

iter and Schmidhuber, 1997).

Within the block unit, three gates control the ﬂow

of cell input and output information (see Figure 1).

There is the input gate, which controls the input of

new information into the cell. The forget gate, which

is responsible for choosing which information is not

pertinent and should be forgotten to enable new in-

formation to be remembered. Finally, the output gate

keeps track of the information coming out of the cell.

Novel Lie Speech Classiﬁcation by using Voice Stress

743

There is an activation function that is responsible for

determining when each of the gates should let the in-

formation ﬂow or not (Hochreiter and Schmidhuber,

1997).

x x

Cell

Input

Output

Input

gate

Forget

gate

Output

gate

Figure 1: A memory cell from a LSTM neural network.

2.4 Related Works

Some recent papers relate works in the context of au-

tomatic lie detection. We have chosen four of the pa-

pers found in the literature. These works helped us to

specify a neural network basic-model for our experi-

ments and analyses.

The work of Liu (2004) (Liu, 2004) has a purpose

similar to the present work. It aims to detect voice

stress levels through pitch and jitter, performing the

implementation of Bayesian hypothesis testing with

Matlab. For this work, the best result obtained was

87% accuracy. However, it was only possible to

achieve this result with a dependent speaker, that is,

the model behaved this way only in the voice analysis

of a single person, and using the Pitch voice feature.

For the multi-speaker test (independent speaker), the

accuracy value drops to 70% using the Pitch charac-

teristic.

The work of Nurc¸in et al. (2017) (Nurc¸in et al.,

2017) aims to detect lies, which is performed by

studying and analyzing the dilation of the pupil of

the human eye, using a neural network with the back-

propagation algorithm. The neural network was im-

plemented and fed with the 60 preprocessed samples.

Networks with various amounts of hidden layers were

tested: 2, 7, 10, 20, and 50. The one with the best re-

sults was the 10 layer network. During training, all

sample images were correctly classiﬁed by the net-

work. With these results, it was found, although not

tested with new images, that it is possible to clas-

sify images of pupils according to their dilation ac-

curately.

Chow and Louie (2017) (Chow and Louie, 2017)

have the objective of detecting a person’s lie using

speech processing and natural language through au-

dio recorded from the subject. An LSTM neural net-

work was used to accomplish this goal. The ﬁnal

recurrent neural network model used was a single-

level, one-way drop-out Long-Short Term Memory

(LSTM). The results with this type of implementa-

tion were around 61 % to 63 % accuracy. The voice

characteristic used for the analysis was the MFCC.

In the work of Dede et al. (2010) (Dede and Sa-

zli, 2010), the objective is to perform isolated speech

recognition, employing the use of neural networks,

where the recognition of digits 0 to 9 uttered by a

speaker is performed. The neural network used is

a recurrent Elman type. In the tests performed, the

speech recognition system developed in this project

recognized the digits very accurately, for Elman net-

works a hit rate of 99.35 % was reached, in PNN net-

works the hit frequency was 100 %, and the MLP net-

work reached 98.75 %. The results obtained in this

study were very satisfactory and determine that Arti-

ﬁcial Neural Networks are an adequate and effective

way to perform speech recognition.

3 METHODS

The hypothetical/deductive method was adopted in

our research since the objective is to prove their sus-

tainability of the following supposition: “It is possible

to detect a lie by analyzing voice stress using a neural

network.”

A literature search was conducted to ﬁnd works

about automatic lie detection through some analysis

of the human body to understand more about how

and what are the most common methods of lie detec-

tion. Then a neural network basic-model was speci-

ﬁed. From the speciﬁed basic-model, the neural net-

work was implemented. The network implemented

in the search for the most useful parameters was an-

alyzed. Finally, the results obtained were evaluated

and discussed.

The present research starts from the study of

speciﬁc-domains of knowledge and concerning the

problem in question, that is still open, without a

deﬁnitive solution.

In the following subsections, it is presented the

data used during the experiments, the preprocessing

performed, the classiﬁcation model and experiments

performed are formalized.

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

744

3.1 Data

One of the crucial stages of this work is related to the

samples to build the corpus. Our experiments involve

the speech of people, so the corpus must contain la-

beled audio ﬁles related to the answers of subjects.

The ﬁrst step to record the audio samples was to

create an interview model, in which one person asks a

question and the subject answer it. The aim is to cap-

ture the subject’s answer. A questionnaire was created

to be followed as an interview script, consisting of 11

questions. After each question asked to the subject,

he answers them, thus creating a dialogue. The inter-

view is conducted twice with each, the ﬁrst time the

subject answers the questions telling lies, and the sec-

ond time the subject answers speaking the truth. So,

we have samples labeled as true or false (lie) answers.

The interview was conducted with 10 male subjects,

in the Brazilian Portuguese language, members of the

Applied Intelligence Laboratory (LIA) of the “Uni-

versidade do Vale do Itaja

ı”, and members of a group

outside the university. Figure 2 shows the amplitude

of a person’s voice from an audio ﬁle of the corpus.

The x-axis represents time, and the y-axis represents

frequency amplitudes.

Figure 2: Graphical voice amplitude representation.

3.2 Preprocessing

During the preprocessing, it was used the software

Audacity, a free audio recording and editing software.

As each round of questions was recorded continu-

ously, without intervals, it was necessary to use Au-

dacity, to divide each interview into several separate

ﬁles, aiming only to contain the answers of the subject

because only these data compose the corpus. After

that, each answer was stored in a single ﬁle. Silence

present at the beginning and end of the ﬁles were re-

moved. Finally, the ﬁles are labeled as true or false

answers.

After completing the editing of all the interviews,

they resulted in a total of 220 audio ﬁles, 110 lying

answers, and 110 telling the truth, with a balanced

numerical proportion to represent each class of the

problem. Of the 220 ﬁles, 180 were assigned to be

used in network training, and the other 40 answers

were separated for classiﬁcation tests. The ﬁles to

perform the classiﬁcation tests were not used in neural

network training. The choice of the samples that was

used in the training and test occurred as follows: from

each interviewed, 9 ﬁles were randomly assigned to

be used in training, and 2 ﬁles designated for testing,

keeping the same proportion for all the subjects. The

duration of each audio ﬁle is variable. It can be 1 sec-

ond or even 7 seconds. Figure 3 shows a graph repre-

senting the MFCC spectrogram that corresponds to an

audio ﬁle extracted from the corpus, where the x-axis

indicates the time, the y-axis indicates the amount of

MFCC, and the colors represent the intensity of the

spectral density of energy present at each frequency

of the sound.

Figure 3: MFCC Espectrogram.

3.3 Input Processing and Feature

Extraction

The python language has a library for music and au-

dio analysis, called Librosa. In the ﬁrst step of the al-

gorithm implementation, the library was used to load

the audio ﬁles from inside the corpus directory. The

audios were loaded by the library with a sample rate

of 22050 Hz, which is the number of samples per sec-

ond. After uploading all database ﬁles, the next step

is to extract the MFCC characteristics from each of

them.

For each uploaded ﬁle, another Librosa library

function is applied, which is a function responsible

for extracting the MFCC characteristics from them.

The amount of MFCC extracted in each audio ﬁle is

predeﬁned as a parameter passed to the function, ini-

tially this value was set to 13, then to 20 and later to

40. Thus obtaining three different datasets results in

the time of extraction and allowing more signiﬁcant

variation of the scenario in the training and testing of

the neural network. For each audio ﬁle, a data matrix

is generated, which is the size relative to the length

of the recorded audio, and the number of MFFC ex-

tracted.

As the extracted audio sequences vary in size ac-

cording to the length of time the recording has, it is

Novel Lie Speech Classiﬁcation by using Voice Stress

745

not possible to infer these variable-length sequences

directly from the neural network. A sequence size

normalization technique called padding is used in this

project to solve this situation. The padding process

proceeds as follows when the MFCC audio sequences

are generated, it is veriﬁed which one has the largest

size, and based on it, all other smaller sequences are

remodeled to have the same size, but ﬁlling empty

spaces with the value 0. An auxiliary vector is cre-

ated, and it is assigned the actual size that each au-

dio sequence has. This is of great importance, as this

information is passed on to the neural network, and

then it will have a reference to how far it can pro-

cess each sequence, and what values it can disregard,

which in this case will be from the actual size, local

to the which padding was performed.

After extracting the voice characteristics of the

corpus, and normalizing them all, the next step is to

label the inputs, that is, to determine which class it be-

longs. At the moment each audio ﬁle has the extracted

MFCC characteristics, there is an auxiliary vector that

stores the information about which category that input

belongs to (being, 0 = Lie and 1 = Truth). The one-hot

encoding technique was used to transform this cate-

gorical data into numerical values. In this way, the

integer values representing classes 0 and 1 are trans-

formed into a vector in which the ﬁrst position is for

the lie category, and the second position is for the

truth category. To say that a given audio input has

a category, place the value bit 1 at the position of the

vector representing that category, and the rest of the

positions must be ﬁlled with the value bit 0, always

considering the fact that each sample must belong to

only one class.

When the feature extraction, sequence normaliza-

tion, and class coding step is completed, the result is

a three-dimensional matrix with the MFCC character-

istics of the interviewees’ voice. Before starting train-

ing, this data is shufﬂed, resulting in the data set ready

to be inferred from the neural network, in which the

characteristics are analyzed, and the network training

is performed.

3.4 Classiﬁer Model

The task of the neural network for this work is clas-

siﬁcation, and the type of learning employed in the

neural network is supervised. For this reason, there

is a training input dataset with related output. These

are the predeﬁned categories for that data. During the

training, the model will be adjusted to map the inputs

to their corresponding outputs, ﬁnding patterns that

correspond to this association between the data and

the label.

The main processing core of the entire project is

deﬁned as the prediction model, which essentially

consists of a learning algorithm and training data.

Within this prediction model, the neural network is

instantiated, and through a set of parameters is de-

ﬁned the number of layers, the number of neurons per

layer, the calculation of weights and bias, and the ac-

tivation function. In the model, training data is in-

ferred, and the loss function, which is responsible for

performing each training iteration, calculates and ver-

iﬁes how close the network has come to mapping in-

puts according to correct outputs. With each error en-

countered, the learning algorithm will update and ad-

just the weights of neural network neurons to ﬁnd the

best training results.

The loss function used in this work is “cross-

entropy” for “softmax”, which can be accessed using

a TensorFlow API (Application Programming Inter-

face) method. This type of loss function is most often

used to quantify the difference between two probabil-

ity distributions, in which an output vector represent-

ing the probability distribution is returned.

The activation function used in the neural net-

work of this work is “softmax”, which is very efﬁcient

when the problem is classiﬁcation of data.

To perform the training, with the optimization al-

gorithm Adam Optimizer was used, which is an algo-

rithm that can be used replacing the classic (relatively

slow) stochastic gradient descent procedure to update

the weights of the neural network in an iterative way,

from the training data.

The dimensions of the input layer of the neu-

ral network are variable to the size of the presented

dataset. It varies according to the size of the sample

batch, the number of MFCC per sample, and the size

of the sequences for each sample.

The number of nodes in the output layer is propor-

tional to the number of classes deﬁned for the input

network samples, and the number of samples that are

inferred from the neural network. Thus, a probability

distribution is generated for the classes deﬁned for the

network.

All prediction model hyperparameters are ﬁrst as-

signed a value to start experiments. They are ad-

justable values and set before the start of network

training. The following parameters are deﬁned: learn-

ing rate, the number of hidden layers, the number of

nodes per hidden layer; batch size; the number of it-

erations; and MFCC number per sequence. Figure

4 shows a diagram representing the prediction model

algorithm steps.

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

746

Figure 4: Fluxogram of Algorithm.

3.5 Experiments

After the parameter deﬁnition, the algorithm was ex-

ecuted. The neural network graph is initialized, and

the training step is started. At each time, the data

is inferred from the neural network prediction model,

and the process is repeated until the required number

of iterations is reached. At the end of each iteration,

the loss function is calculated, which will represent

how close that training step is to the desired results,

and as the result of the calculated error, the optimiza-

tion algorithm performs the adjustment of the network

weights and biases, search to decrease the error value,

and the algorithm proceeds to the next iteration of the

training.

When all training iterations are completed, the

prediction model receives the test samples, and then

the number of samples that have been correctly clas-

siﬁed is calculated, thus obtaining the accuracy of this

model. Analyzing the accuracy result, the closer to 1

the value, the greater the number of samples that were

correctly classiﬁed; the closer to 0, the greater the

number of errors during the classiﬁcation of the test

samples. Figure 5 shows a graph of the loss function

value according to the iteration number. It is possible

to see the value of the loss function approaching zero

when the number of iterations increases.

4 RESULTS

Only the most effective results were selected among

all sets of tested parameters and models. The results

have signiﬁcance and precision above a chance. Ta-

ble 1 lists the prediction models with their respective

hyperparameters, and the accuracy reached by each

of them. It is necessary to deﬁne the hyperparam-

Figure 5: Loss Graph.

eters that are responsible for conﬁguring the neural

network topology and the training process to perform

the training. The column “Model No.” that is present

in Table 1, is only to facilitate the identiﬁcation of the

model when it is quoted later in the text.

Several parameter settings are combined to ensure

good performance in training and testing. The “Lay-

ers” column represents the number of hidden layers

of the neural network that are responsible for all input

data processing. The “Cells” column is the number

of LSTM cells in each hidden layer of the network,

each unit of cell performs the processing of a charac-

teristic, and directs its processing output to the next

layer. The column “No.MFCC” refers to the amount

of MFCCs characteristics present in each audio ﬁle of

the data set used in neural network training and test-

ing. Regarding “Batch”, the training dataset is made

up of a total of 180 entries with MFCC character-

istics, but this set of entries can be broken up into

smaller batches for training, which are called mini-

batches. Because the entries have been fragmented

into smaller batches, each iteration goes through only

one batch, and the training season will only be re-

alized when all the mini-batches in the set are fully

Novel Lie Speech Classiﬁcation by using Voice Stress

747

run. “Iterations” corresponds to each training step

performed, which is completed when a batch is taken.

“Learn. Rate” refers to the size of the change made

by the learning algorithm in neural network weights.

Thus, the higher this value, the greater will be the

changes in neural network weights during training.

“Accuracy” is the percentage of correct classiﬁcations

in the test set that the neural network model obtained

after performing the training.

Model 1 presented an accuracy of 72.5% and was

the highest reached in this work. This model reached

a total of 29 samples correctly classiﬁed, 11 sam-

ples classiﬁed wrong, totaling the 40 tested samples.

Among these errors, 6 are false positives, and 5 are

false negatives.

Model number 2 brought very relevant results, as

described in Table 1 presented an accuracy of 70.0%.

This model reached a total of 28 correctly classiﬁed

samples, 12 wrong classiﬁed samples, totaling 40 test

samples. Of these errors, 5 are false positives, and 7

are false negatives.

Following Table 1, model number 3 presented an

accuracy of 67.5%. This model reached a total of 27

correctly classiﬁed samples, 13 wrong classiﬁed sam-

ples, totaling 40 test samples. Of these errors, 6 are

false positives, and 7 are false negatives.

The next model listed is number 4 which showed

an accuracy of 65.0% as described in Table 1. This

model reached a total of 26 correctly classiﬁed sam-

ples, 14 wrongly classiﬁed samples, totaling 40 test

samples. Of these errors, 5 are false positives, and 9

are false negatives.

Model number 5, as described in Table 1, pre-

sented an accuracy of 55.0%. Among all the models

mentioned, this was the one that reached the lowest

result. This model reached only a total of 22 correctly

classiﬁed samples, 18 wrongly classiﬁed samples, to-

taling 40 test samples. Of these errors, 9 are false

positives, and 9 are false negatives. This model is the

only one in the results presented, which has the value

of 40 of MFCC. For this parameter setting, despite

the low result, it was the model that brought the most

signiﬁcant result.

It may be possible to perceive that there are simi-

larities in the parameters of each model. For the layer

quantity parameter, values were only between 3 and

4, and when using values smaller or larger than this,

we have noticed that there is no better result. The

number of cells in the hidden layers was between 200

and 300. Out of this range, there were no results with

any signiﬁcance. The ﬁle batch size was deﬁned on

all models with a value of 64, which brought stabil-

ity to the model when it was trained several times,

always producing equivalent and more stable results.

The number of iterations was between 100 and 180,

and it was observed that when the number of itera-

tions was higher than 200, it produced an overﬁtting

behavior in the model. This means that the model pre-

sented good classiﬁcation results during training, but

for a set of new data, which is the data for testing, the

model demonstrates inefﬁciency. The learning rate

showed variable values according to each model. It

can be observed in a considerably high-value range,

varying from 0.001 to 0.01. One of the parameters

that greatly inﬂuenced the results was the number of

MFCCs, in which models deﬁned with values of 13

and 20, proved to be much more effective in the re-

sults than the model that had 40 MFCC. The high-

est accuracy obtained by a model with 40 MFCC was

55.0%, which is an inaccurate result and does not rep-

resent that the model is an effective classiﬁer.

5 CONCLUSIONS AND FINAL

REMARKS

The work-related in this paper aims to analyze an

LSTM performance when classifying a person’s voice

answer as reliable or not. For that, it was necessary

to verify which prediction models implemented led

the best results by checking its accuracy. By verify-

ing directly, the model that has the highest number of

correct ratings during the tests demonstrates the most

efﬁcient in the results. In the ﬁnal stage of this work,

the most relevant results were listed, and among them,

the one that showed the most prominence is a model

that reached an accuracy of 72.5%. That is, 72.5%

of the test samples were classiﬁed correctly. It is still

possible to state that there was a relevant statistical

signiﬁcance since the model behaved above the level

of chance. For the context of this study, which is lie

detection through the analysis of voice answers, the

results obtained in this project can be considered rel-

evant, since there is no such result in the literature.

It can be observed that the obtained results are

close to other similar works. Similar work using sim-

ilar technology with the experiment performed in this

project is the work of (Chow and Louie, 2017), where

the purpose is to ﬁnd patterns of lying in the voice

(through the MFCC) and in the sample interview tran-

scripts, using as a prediction model a recurrent LSTM

neural network. Following these speciﬁcations, the

best result obtained in the work of (Chow and Louie,

2017) is 63% of accuracy, slightly lower than that

achieved in this work. One detail is that to perform

the work of CHOW and LOUIE, was used a dataset

ready for the type of problem to be solved, and with

a large number of samples, which can signiﬁcantly

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

748

Table 1: Results of each tested basic-model and its accuracy.

Model Layers Cells N

MFCC Batch Iterations Learn. Rate Accuracy (%)

1 3 300 13 64 150 0.01 72.5

2 4 300 20 64 100 0.003 70.0

3 3 300 13 64 80 0.003 67.5

4 4 200 20 64 100 0.003 65.0

5 3 300 40 64 180 0.001 55.0

facilitate the elaboration of other steps of the work,

and signiﬁcantly increase results according to data set

size.

Another work with similar purpose and that can

be used as a comparison parameter is that of (Liu,

2004), which aimed to detect stress levels in the voice,

through pitch and jitter, performing the implementa-

tion of Bayesian hypothesis with MatLab. The re-

sult of this work reached an accuracy of 87%; how-

ever, this result was only possible on test occasions

in which the trained classiﬁer system was dependent

on the individual speaker, which corresponds to the

fact that the system was trained and tested only with

the voice of a single person. For tests performed in

which the system was trained with the voice of sev-

eral speakers, the best accuracy achieved was 70%,

which represents a value slightly lower than the re-

sult found in this work, which was 72.5%. It is note-

worthy that the (Liu, 2004) dataset was built by him-

self, through audio recordings that occurred during a

game, in which the participants’ goal was to lie to win.

Regarding the problems presented at the begin-

ning of this paper, the results suggest it is possible

to state that there is the possibility of detecting lies

through the speech of the individual using a neural

network. Since lie detection is an open problem, the

main contribution of this paper was to relate the stress

level of the voice as an inducer to detect lies. The use

of Deep Learning to perform this procedure, being un-

precedented, characterizes a contribution to the state-

of-the-art. As future works, we suggest the use of

more massive voice databases than used in this work,

including a diversity of accents and languages.

ACKNOWLEDGEMENTS

This work is partly supported by the “Universidade

do Vale do Itaja

ı” and “Universidade Federal de Santa

Catarina”.

REFERENCES

Cardoso, D. P. (2009). Identiﬁcac¸

ao de locutor usando mod-

elos de misturas de gaussianas. Master’s thesis, Escola

Polit

ecnica, Universidade de S

ao Paulo, S

ao Paulo.

Childers, D. G., Skinner, D. P., and Kemerait, R. C. (1977).

The cepstrum: A guide to processing. Proceedings of

the IEEE, 65(10):1428–1443.

Chow, A. and Louie, J. (2017). Detecting lies via speech

patterns.

Council, N. R. (2003). The Polygraph and Lie Detection.

The National Academies Press, Washington, DC.

Damphousse, K. (2009). Voice stress analysis: Only 15

percent of lies about drug use detected in ﬁeld test.

NIJ Journal, 259.

Dede, G. and Sazli, M. (2010). Speech recognition with

artiﬁcial neural networks. Digital Signal Processing,

20:763–768.

Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J.

(2001). Gradient ﬂow in recurrent nets: the difﬁculty

of learning long-term dependencies. In Kremer, S. C.

and Kolen, J. F., editors, A Field Guide to Dynamical

Recurrent Neural Networks. IEEE Press.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural computation, 9(8):1735–1780.

Kremer, R. L. and Gomes, M. L. d. C. (2014). A eﬁci

encia

do disfarce em vozes femininas: uma an

alise da

frequ

encia fundamental. ReVel, 12(23):28–43.

Liu, X. F. (2004). Voice stress analysis: Detecion of decep-

tion. Master’s thesis, Department of Computer Sci-

ence – The University of Shefﬁeld.

Nurc¸in, F., Imanov, E., Is¸ın, A., and Uzun Ozsahin, D.

(2017). Lie detection on pupil size by back propa-

gation neural network. Procedia Computer Science,

120:417–421.

Ofﬁce of Technology Assessment’s (1983). Scientiﬁc valid-

ity of polygraph testing: a research review and evalu-

ation. Technical report, U.S. Congress.

Rao, K. R. and Yip, P. (1990). Discrete Cosine Trans-

form: Algorithms, Advantages, Applications. Aca-

demic Press Professional, Inc., San Diego, CA, USA.

Teixeira, J., Ferreira, D., and Carneiro, S. (2011). An

alise

ustica vocal - determinac¸

ao do jitter e shimmer para

diagn

ostico de patalogias da fala.

Tiwari, V. (2009). Mfcc and its applications in speaker

recognition. International Journal on Emerging Tech-

nologies.

Novel Lie Speech Classiﬁcation by using Voice Stress

749