Mobile Phone Identiﬁcation from Recorded Speech Signals Using

Non-Speech Segments and Universal Background Model Adaptation

Dimitrios Kritsiolis

and Constantine Kotropoulos

Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece

Keywords:

Mobile Phone Identiﬁcation, Digital Speech Processing, Audio Forensics, Voice Activity Detection,

Universal Background Model.

Abstract:

Mobile phone identiﬁcation from recorded speech signals is an audio forensic task that aims to establish the

authenticity of a speech recording. The typical methodology to address this problem is to extract features

from the entire signal, model the distribution of the features of each phone, and then perform classiﬁcation

on the testing data. Here, we demonstrate that extracting features from non-speech segments or extracting

features from the entire recording and modeling them using a Universal Background Model (UBM) of speech

improves classiﬁcation accuracy. The paper’s contribution is in the disclosure of experimental results on

two benchmark datasets, the MOBIPHONE and the CCNU Mobile datasets, demonstrating that non-speech

features and UBM modeling yield higher classiﬁcation accuracy even under noisy recording conditions and

ampliﬁed speaker variability.

1 INTRODUCTION

Mobile phone identiﬁcation from recorded speech

signals aims to establish the authenticity of a speech

recording. It is not uncommon for speech record-

ings from mobile phones to be presented as evidence

in criminal investigations or other legal proceedings.

However, for the recording to be admitted as evidence

in a legal case, the court must be convinced of its au-

thenticity and integrity (Maher, 2009).

Here, we examine the problem of brand and model

identiﬁcation of mobile phones from recorded speech

signals. The term brand refers to the mobile phone

manufacturer, e.g. LG, and the term model refers to a

speciﬁc product from a manufacturer, e.g. LG L3.

A three-stage process can describe mobile phone

identiﬁcation from recorded speech signals. In the

ﬁrst stage, features capable of capturing the intrinsic

trace each device leaves on the speech signal are ex-

tracted. In the second stage, the extracted features

are used to train models that represent the distribu-

tion of features of each mobile phone. In the third

stage, classiﬁcation is performed using the trained

models and the extracted features. The features to

be examined are the Mel Frequency Cepstral Coef-

https://orcid.org/0009-0009-3845-5928

https://orcid.org/0000-0001-9939-7930

ﬁcients (MFCCs) (Davis and Mermelstein, 1980) ex-

tracted on a frame basis. The extracted feature vec-

tors are then used to train Gaussian Mixture Models

(GMMs) using the Expectation Maximization (EM)

algorithm (Dempster et al., 1977). The GMMs can

be used for Maximum Likelihood (ML) classiﬁcation

or the extraction of ﬁxed dimensional feature vectors,

known as Gaussian Supervectors (GSVs) (Garcia-

Romero and Espy-Wilson, 2010). The GSVs are ob-

tained by concatenating the mean vectors (and option-

ally the main diagonals of the covariance matrices) of

the GMMs. They are used in Support Vector Machine

(SVM) classiﬁcation.

We will explore two different approaches to mo-

bile phone identiﬁcation. The ﬁrst approach is to

extract the features from the non-speech segments

of the signal since the speech parts might introduce

speaker-related noise into the device’s intrinsic trace.

The second approach stems from the fact that mo-

bile phone identiﬁcation based on recorded speech

signals is similar to the speaker recognition prob-

lem. Thus, approaches used in the latter ﬁeld can

be applied in the former. One of these approaches

is the Universal Background Model (UBM) of speech

(Reynolds et al., 2000) that is employed to adapt the

mobile phones’ feature vectors by means of Maxi-

mum A Posteriori (MAP) adaptation. GSVs are then

extracted from the UBM-adapted GMMs and are used

Kritsiolis, D. and Kotropoulos, C.

Mobile Phone Identiﬁcation from Recorded Speech Signals Using Non-Speech Segments and Universal Background Model Adaptation.

DOI: 10.5220/0012420400003654

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 13th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2024), pages 793-800

ISBN: 978-989-758-684-2; ISSN: 2184-4313

793

in the classiﬁcation stage.

The paper contributes to disclosing experimental

evidence using either ML, SVM, or neural classiﬁ-

cation on two benchmark datasets, namely the MO-

BIPHONE dataset and the CCNU Mobile dataset.

The empirical results demonstrate that both the non-

speech and the UBM approaches yield higher accu-

racy than the traditional approach of extracting fea-

tures from the entire speech signal and modeling them

with GMMs trained via the EM algorithm, even un-

der noisy recording conditions and ampliﬁed speaker

variability.

2 RELATED WORK

(Hanilci et al., 2011) proposed a mobile phone iden-

tiﬁcation system using MFCCs, Vector Quantization

classiﬁers, and SVMs. (Hanilc¸i and Kinnunen, 2014)

extended their previous work by showing that MFCCs

obtained from the non-speech segments yield higher

mutual information than the MFCCs obtained from

the speech segments or the entire speech recording.

(Kotropoulos and Samaras, 2014) used MFCCs to

train GMMs and to obtain GSVs, which were used

with SVMs, a Radial Basis Function neural network,

and a Multilayer Perceptron. To test their methods,

they collected the MOBIPHONE dataset. (Kotropou-

los, 2014) obtained sketches of features from large-

size raw feature vectors and used them with a Sparse

Representation Classiﬁer and SVMs on the Lincoln-

Labs Handset Database (Reynolds, 1997) and on the

MOBIPHONE dataset.

(Baldini and Amerini, 2019) proposed a mo-

bile phone identiﬁcation approach by stimulating the

phones’ microphones with non-voice sounds. They

extracted the frequency spectrum magnitude and used

it as the input to a Convolutional Neural Network

(CNN), which performed the identiﬁcation task. (Bal-

dini and Amerini, 2022) continued their work by ob-

taining spectral entropy features based on Shannon

entropy and Renyi entropy to implement dimension-

ality reduction of the spectral representation of audio

signals. These features were then the input to their

previous CNN.

(Giganti et al., 2022) proposed identifying mobile

phones from their microphone under noisy conditions

by applying convolutional neural network-based de-

noising on the signal’s spectrogram.

(Berdich et al., 2022) explored mobile phone mi-

crophone ﬁngerprinting based on human speech, en-

vironmental sounds, and several live recordings per-

formed outdoors, using supervised machine learning

methods.

(Zeng et al., 2020) proposed an end-to-end deep

source recording device identiﬁcation system for web

media forensics utilizing the MFCCs, the GSVs, and

the i-vectors. (Zeng et al., 2023) expanded their work

by proposing a spatio-temporal representation learn-

ing framework utilizing the MFCCs and the GSVs.

(Berdich et al., 2023) surveyed smartphone ﬁnger-

printing technologies based on sensor characteristics.

(Qamhan et al., 2023) proposed a transformer net-

work for authenticating the source microphone. The

Audio Forensic Dataset for Digital Multimedia Foren-

sics (Khan et al., 2018) and the King Saud University

speech database (Alsulaiman et al., 2013) were used.

(Leonzio et al., 2023) addressed audio splicing de-

tection and localization by using the CNN in (Baldini

and Amerini, 2019) for mobile phone classiﬁcation.

They applied K-means to extracted features from in-

termediate layers of the CNN to detect splicing and

employed the cosine distance for localization between

consecutive frame vectors.

3 MOBILE PHONE

IDENTIFICATION

Due to tolerances in the manufacturing of electronic

components, each implementation of an electronic

circuit, particularly when a microphone is included,

cannot have the same transfer function (Hanilci et al.,

2011). The recorded speech signal’s frequency spec-

trum can be considered as the multiplication of the

spectrum of the original speech signal with the trans-

fer function of the mobile phone’s microphone. Con-

sequently, every mobile phone leaves its own device

ﬁngerprint on the recorded speech signal.

Mobile phone identiﬁcation is very similar to

closed-set text-independent speaker recognition. A

simple block diagram of an automatic speaker recog-

nition system is shown in Fig. 1. The same princi-

ples used in speaker recognition also apply to mobile

phone identiﬁcation. Closed-set mobile phone identi-

ﬁcation is done in 3 stages: feature extraction, model-

ing, and classiﬁcation.

Figure 1: Block diagram of a basic closed-set automatic

speaker recognition system.

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

794

The MFCCs are the features used and they are

based on the speech signal spectrum. The shape of

the vocal tract gives us information about the speaker

uttering the speech or the phoneme being uttered. It is

found in the spectral envelope of the short-time power

spectrum. The MFCCs try to represent that envelope.

Once the MFCCs have been extracted, each mo-

bile phone model or brand is modeled using a GMM

trained on the corresponding short-time features us-

ing the EM algorithm. The GMM probability den-

sity function is used to measure the similarity between

the testing data of an unknown mobile phone and the

trained GMMs. To obtain a ﬁxed-dimensional vector

representing a speech utterance for SVM and neural

classiﬁcation, we construct GSVs by concatenating

the parameters of a GMM trained on the speech ut-

terance.

After training the GMMs and extracting the

GSVs, classiﬁcation is performed. Firstly, an ML

classiﬁer is utilized by using the MFCC vectors ex-

tracted from the testing samples and comparing their

similarity with the trained GMMs. SVMs are also

trained using the GSVs extracted from the GMMs.

Finally, we employ the neural architecture in (Zeng

et al., 2023) for mobile phone identiﬁcation that con-

siders both the MFCCs and the GSVs. The mobile

phone identiﬁcation process outlined above will be re-

ferred to as the traditional approach. We examine two

different modiﬁcations to the traditional approach.

The ﬁrst modiﬁcation is referred to as the

non-speech approach. Speech parts have speaker-

dependent information that is irrelevant to the task at

hand. The noise-like signals in the non-speech parts

have a ﬂatter spectral density and better capture the

mobile phone’s recording circuitry transfer function.

Accordingly, the feature extraction stage is modiﬁed

to extract MFCCs from non-speech parts. To obtain

non-speech parts, we use either a voice activity detec-

tion algorithm (Sohn et al., 1999) or speech enhance-

ment to estimate a clean speech signal and subtract it

from the initial speech signal to obtain a residual noise

signal. The speech enhancement method utilizes the

noise estimation method described in (Martin, 2001).

The second modiﬁcation, referred to as the UBM

approach, stems from the fact that the problem of mo-

bile phone identiﬁcation from recorded speech sig-

nals is very similar to the speaker recognition prob-

lem (Hanilci et al., 2011). A speech signal is mod-

eled in the time domain as the convolution of the vo-

cal cords’ excitation and the vocal tract’s impulse re-

sponse. Speaker recognition aims to extract features

that can distinguish the transfer function of the vo-

cal tract of each speaker. A recorded speech signal,

however, also contains a convolutional distortion in-

troduced by the mobile phone’s microphone. Sup-

pose that the recorded speech signal is produced by

a transfer function given by the serial composition of

the vocal tract transfer function and the microphone’s

transfer function. In that case, the task at hand be-

comes tantamount to speaker recognition. Thus, the

use of a UBM model, a large speaker-independent

GMM trained via the EM algorithm to capture the dis-

tribution of the MFCCs in a very large set of speak-

ers, becomes justiﬁed. The UBM is used to adapt the

GMMs directly via MAP adaptation.

4 DATASETS

4.1 MOBIPHONE and Augmentations

The MOBIPHONE dataset (Kotropoulos and Sama-

ras, 2014) consists of recorded speech utterances of

24 speakers, 12 males and 12 females, using 21 mo-

bile phones belonging to 7 brands. The 24 speakers

were randomly chosen from the TIMIT dataset (Garo-

folo et al., 1988). Each speaker uttered 10 sentences

approximately 3 seconds long. Each speaker’s sen-

tences were concatenated in a single WAV ﬁle. Thus,

the dataset contains 504 WAV ﬁles sampled at 16kHz,

one for each speaker recorded on each mobile phone.

To compare the performance of the different fea-

ture extraction and modeling approaches, 8 augmen-

tations were applied to the entire dataset to create 8

additional versions of the MOBIPHONE dataset.

The ﬁrst set of 3 augmentations aims to simulate

different recording conditions. It includes the addi-

tion of Gaussian noise at a random signal-to-noise ra-

tio (SNR) between 10-20 dB, the addition of back-

ground chattering noise at a random SNR between

10-20 dB, and the addition of reverberation.

The second set of another 5 augmentations aims

to introduce more speaker variability by the removal

of random croppings from the audio signal to enlarge

the dataset, the adjustment of the loudness volume of

some segments (i.e., multiplication by a constant in

the range [0.5, 2]), the modiﬁcation of the pitch with

a factor in the range [−10, 10], the modiﬁcation of

the speed (i.e., time scaling with a factor in the range

[0.5, 2]), and Vocal Tract Length Perturbation (VTLP)

with a factor in the range [0.9, 1.1].

4.2 CCNU Mobile

The Central China Normal University (CCNU) Mo-

bile dataset (Zeng et al., 2020) consists of speech

recordings from the TIMIT dataset recorded by 45

Mobile Phone Identiﬁcation from Recorded Speech Signals Using Non-Speech Segments and Universal Background Model Adaptation

795

mobile phones belonging to 9 brands. Multiple de-

vices of the same mobile phone model are included

in the dataset. The recordings of each phone consist

of 642 audio samples per mobile phone with a dura-

tion of about 8 seconds sampled at 32kHz. Unfortu-

nately, the entire dataset is not publicly available. We

were able to use a small subset of the dataset, which

consists of 5 8-second-long audio samples of female

speakers per mobile phone. This data scarcity will al-

low us to test our methods under realistic audio foren-

sic conditions with limited data.

5 EXPERIMENTAL RESULTS

5.1 Results on MOBIPHONE

Experiments were conducted on the MOBIPHONE

dataset using the traditional, non-speech, and UBM

approaches. Two tasks were considered, namely

brand identiﬁcation and model identiﬁcation. We

used a 50%/50% train/test split with half male and

half female speakers in both sets.

23 MFCCs were extracted using 20 ms Hamming

windows with 10 ms overlap. The MFCCs for the

non-speech approach were extracted from non-speech

frames obtained via voice activity detection. The

MFCCs for the traditional and UBM approaches are

referred to as audio MFCCs, while the MFCCs for

the non-speech approach are referred to as non-speech

MFCCs.

Model- and brand-dependent GMMs were trained

via the EM algorithm, using a k-means++ initializa-

tion for both the audio and non-speech MFCC vec-

tors. We trained GMMs with the number of com-

ponents in the range 1, 2, 4, 8, 16, 32, and 64.

These GMMs were used in the ML classiﬁcation. For

the SVM classiﬁcation, utterance-dependent GMMs

were trained using audio and non-speech MFCC vec-

tors. The components of these GMMs were in the

range 1, 2, and 4 because the classiﬁcation accuracy

was drastically reduced when more than 4 compo-

nents were employed in the extracted GSVs due to

the large number of features.

Finally, utterance-dependent GMMs were also

trained using UBM MAP adaptation applied to the au-

dio MFCCs. The UBMs were trained on the TIMIT

dataset. In the case of the UBM-adapted GSVs, it

was seen that SVMs with UBMs up to 16 components

yielded good results. As is most common, only the

means were adapted.

The results of the initial experiments on the MO-

BIPHONE dataset are listed in Tables 1-4. The

ﬁrst column shows the number of components of the

GMMs. The remaining columns display the accu-

racies of the traditional approach (Audio), the non-

speech approach (Non-speech), and the UBM ap-

proach (UBM) on the corresponding classiﬁcation

method.

Table 1: ML classiﬁcation accuracy for brand identiﬁcation

on the MOBIPHONE dataset using audio and non-speech

MFCCs.

# Components Audio Non-speech

1 66.6667% 79.3651%

2 73.4127% 86.1111%

4 88.4921% 96.8254%

8 91.6667% 99.2063%

16 88.0952% 99.6032%

32 93.254% 100%

64 98.0159% 99.6032%

Table 2: ML classiﬁcation accuracy for model identiﬁcation

on the MOBIPHONE dataset using audio and non-speech

MFCCs.

# Components Audio Non-speech

1 96.0317% 98.8095%

2 98.8095% 99.6032%

4 98.4127% 100%

8 99.2063% 100%

16 98.4127% 100%

32 98.8095% 100%

64 98.4127% 100%

A ﬁrst remark on the results in Tables 1-4 is that

the brand identiﬁcation problem seems harder than

the model identiﬁcation problem. In practice, the

brand can be easily deduced from the model. Since

model identiﬁcation performs better, brand identiﬁca-

tion is of no practical use. However, by examining it

we see the accuracy increase offered by non-speech

MFCCs and UBM adaptation. The non-speech and

the UBM approaches perform better than the tradi-

tional approach when performing ML or SVM clas-

siﬁcation. In the non-speech approach, we also note

that the GSVs with the covariance don’t always yield

an accuracy increase compared to the mean GSVs.

This can also be veriﬁed in the following experiments.

The same classiﬁcation procedure was followed

on the augmented versions of the MOBIPHONE

dataset. The aim is to demonstrate how the dif-

ferent approaches perform under different types of

noise. The 504 samples of the baseline MOBI-

PHONE dataset were expanded by the 504 samples

of one or more augmented versions.

The best accuracy on the ﬁrst set of augmenta-

tions, which simulate different recording conditions,

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

796

Table 3: SVM classiﬁcation accuracy for brand identiﬁcation on the MOBIPHONE dataset using GSVs extracted from audio

MFCCs, non-speech MFCCs, and UBM adaptation.

Audio Non-speech UBM

# Components Means Means+Cov Means Means+Cov Means

1 93.6508% 96.4286% 97.619% 97.619% 98.0159%

2 93.6508% 96.0317% 97.619% 97.619% 97.2222%

4 82.5397% 85.3175% 92.8571% 92.8571% 98.8095%

8 - - - - 98.8095%

16 - - - - 96.0317%

Table 4: SVM classiﬁcation accuracy for model identiﬁcation on the MOBIPHONE dataset using GSVs extracted from audio

MFCCs, non-speech MFCCs, and UBM adaptation.

Audio Non-speech UBM

# Components Means Means+Cov Means Means+Cov Means

1 94.0476% 96.4286% 99.2063% 98.8095% 94.0476%

2 94.0476% 96.4286% 97.619% 97.2222% 98.0159%

4 92.8571% 94.4444% 96.4286% 95.2381% 98.8095%

8 - - - - 98.8095%

16 - - - - 94.4444%

Table 5: The best classiﬁcation accuracy on MOBIPHONE for brand identiﬁcation on the ﬁrst set of augmentations.

Audio Non-speech UBM

ML SVM ML SVM SVM

Baseline 98.0159%

96.4286%

1/MC

100%

97.619%

1/M

98.8095%

4/M

Gaussian 85.7143%

89.2857%

1/M

81.3492%

92.2619%

1/MC

89.2857%

1/M

Background 92.0635%

92.8571%

2/M

86.5079%

91.6667%

1/M

98.4127%

4/M

Reverberation 96.8254%

95.0397%

1/MC

99.0079%

97.2222%

1/MC

98.4127%

8/M

All 88.8889%

93.0556%

1/MC

83.3333%

90.4762%

1/M

95.2381%

2/M

Table 6: The best classiﬁcation accuracy on MOBIPHONE for model identiﬁcation on the ﬁrst set of augmentations.

Audio Non-speech UBM

ML SVM ML SVM SVM

Baseline 99.2063%

96.4286%

1/MC

100%

99.2063%

1/M

98.8095%

4/M

Gaussian 89.881%

90.6746%

1/MC

78.9683%

93.8492%

1/M

92.8571%

2/M

Background 93.0556%

95.4365%

1/MC

83.3333%

94.0476%

1/MC

97.4206%

2/M

Reverberation 98.6111%

96.2302%

2/MC

99.6032%

98.4127%

1/M

98.2143%

4/M

All 91.5675%

92.2619%

1/M

80.2579%

91.8651%

1/MC

96.2302%

2/M

is summarized in Tables 5-6 for brand and model

identiﬁcation, respectively. The best hyperparameters

are indicated, namely the number of Gaussian com-

ponents and the type of GSVs (means/means and co-

variance). The best accuracy listed in Tables 1-4 is

quoted in the row Baseline of Tables 5-6.

For brand and model identiﬁcation, non-speech

ML classiﬁcation performs worse than the traditional

ML classiﬁcation in all cases except for the reverber-

ation augmentation. However, non-speech SVM clas-

siﬁcation performs better than the traditional SVM

classiﬁcation in the Gaussian noise and reverberation

augmentations. In all cases, the UBM-adapted SVM

classiﬁcation achieves higher accuracy than the tradi-

tional SVM classiﬁcation. It also yields the top ac-

curacy when all the augmentations are used together.

Overall, the UBM approach performs better on the

ﬁrst set of augmentations. This is attributed to the

probabilistic alignment of the MFCC vectors to the

UBM, which can mitigate the effects of the noise on

the features.

The best classiﬁcation accuracy on the second set

of augmentations is quoted in Tables 7-8 for brand

and model identiﬁcation, respectively. The non-

speech ML classiﬁcation of brand and model achieves

the highest accuracy for all augmentations but the

VTLP augmentation in the model identiﬁcation task.

Speaker variability is eliminated because the non-

speech features are not affected much by the speaker-

dependent noise ampliﬁed with this set of augmen-

tations. When SVM classiﬁcation is employed, the

UBM and non-speech approaches achieve better ac-

Mobile Phone Identiﬁcation from Recorded Speech Signals Using Non-Speech Segments and Universal Background Model Adaptation

797

Table 7: The best classiﬁcation accuracy on MOBIPHONE for brand identiﬁcation on the second set of augmentations.

Audio Non-speech UBM

ML SVM ML SVM SVM

Baseline 98.0159%

96.4286%

1/MC

100%

97.619%

1/M

98.8095%

4/M

Crop 98.2143%

96.8254%

1/MC

99.8016%

98.6111%

1/M

99.0079%

16/M

Loudness 97.4206%

96.0317%

1/MC

99.0079%

99.2063%

1/M

97.8175%

2/M

Pitch 92.0635%

86.3095%

1/MC

97.4206%

94.8413%

2/MC

94.4444%

16/M

Speed 98.2143%

97.0238%

2/MC

99.6032%

99.2063%

1/MC

98.4127%

16/M

VTLP 96.4286%

98.0159%

1/MC

98.4127%

97.619%

1/MC

97.619%

8/M

All 95.6349%

95.172%

1/MC

98.0159%

96.627%

1/MC

96.9577%

8/M

Table 8: The best classiﬁcation accuracy on MOBIPHONE for model identiﬁcation on the second set of augmentations.

Audio Non-speech UBM

ML SVM ML SVM SVM

Baseline 99.2063%

96.4286%

1/MC

100%

99.2063%

1/M

98.8095%

4/M

Crop 99.4048%

96.2302%

2/MC

100%

99.0079%

1/MC

99.6032%

4/M

Loudness 98.8095%

97.4206%

1/MC

99.2063%

99.0079%

1/M

99.0079%

2/M

Pitch 94.6429%

86.9048%

1/MC

98.4127%

97.0238%

1/MC

97.4206%

4/M

Speed 98.8095%

96.627%

1/MC

100%

99.4048%

1/MC

98.0159%

8/M

VTLP 99.4048%

98.0159%

1/MC

98.8095%

99.2063%

1/MC

98.8095%

4/M

All 97.2884%

93.1217%

1/M

98.8095%

96.6931%

1/M

98.8095%

8/M

Table 9: The best classiﬁcation accuracy on MOBIPHONE for brand identiﬁcation on both sets of augmentations combined.

Audio Non-speech UBM

ML SVM ML SVM SVM

Baseline 98.0159%

96.4286%

1/MC

100%

97.619%

1/M

98.8095%

4/M

All augmentations 91.4021%

91.9312%

1/MC

89.903%

93.9594%

1/M

93.6067%

2/M

Table 10: The best classiﬁcation accuracy on MOBIPHONE for model identiﬁcation on both sets of augmentations combined.

Audio Non-speech UBM

ML SVM ML SVM SVM

Baseline 99.2063%

96.4286%

1/MC

100%

99.2063%

1/M

98.8095%

4/M

All augmentations 93.739%

92.0635%

2/MC

89.9471%

94.0035%

1/M

93.2099%

2/M

Table 11: Mean and standard deviation of the classiﬁcation

accuracy for brand identiﬁcation on the neural network us-

ing Zeng’s UBM.

Zeng’s UBM

MFCC branch 41.7777% (5.1225%)

GSV branch 91.4074% (0.9999%)

Fusion branch 70.7407% (7.7%)

Table 12: Mean and standard deviation of the classiﬁcation

accuracy for model identiﬁcation on the neural network us-

ing Zeng’s UBM.

Zeng’s UBM

MFCC branch 23.9555% (6.6566%)

GSV branch 96.2222% (1.54%)

Fusion branch 73.8888% (10.2003%)

curacy than the traditional approach.

When both sets of augmentations are used to-

gether, the brand and model identiﬁcation accuracy

is summarized in Tables 9-10. In both tasks, non-

speech SVMs yield the best accuracy. Non-speech

ML classiﬁcation performs worse than the traditional

ML classiﬁcation. UBM-adapted SVM classiﬁcation

also performs better than the traditional SVM classi-

ﬁcation.

5.2 Results on CCNU Mobile

The CCNU Mobile dataset was used to compare the

identiﬁcation methods further. In addition to the ML

and SVM classiﬁcation methods, the neural architec-

ture designed speciﬁcally for mobile phone identiﬁ-

cation in (Zeng et al., 2023) was included.

The entirety of the CCNU Mobile dataset is not

available, and we could use only a small sample of

it containing 5 audio samples from female speakers

for each of the 45 mobile phones. However, the lim-

ited data helps us test the quality of the extracted data

since the feature vectors are so few the classiﬁers have

to rely on the data’s quality and not quantity to yield

better results.

The ﬁrst two speakers were chosen for the train

set, while the remaining 3 speakers were selected for

the test set. To be consistent with the features used

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

798

Table 13: Mean and standard deviation of the classiﬁcation accuracy for brand identiﬁcation using the three approaches with

ML, SVMs, and the neural network proposed by (Zeng et al., 2023).

Audio Non-speech UBM

ML-64 86.5899% (1.7982%) 99.2593% (0%) -

SVM-1M 86.0741% (4.2881%) 86.8148 % (5.8514%) -

SVM-1MC 68.6667% (10.0482%) 83.4074% (4.7263%) -

SVM-4M - - 86.5926% (1.3726%)

MFCC branch 45.3333% (6.5309%) 60.3703% (6.6369%) 44.5185% (3.7357%)

GSV branch 61.0370% (3.7803%) 83.3333% (2.5241%) 94.7407% (1.3725%)

Fusion branch 60.5926% (4.9481%) 81.6296% (3.9009%) 70.0740% (7.3761%)

Table 14: Mean and standard deviation of the classiﬁcation accuracy for model identiﬁcation using the three approaches with

ML, SVMs, and the neural network proposed by (Zeng et al., 2023).

Audio Non-speech UBM

ML-64 98.5185% (0%) 98.5185% (0%) -

SVM-1M 90.8148% (5.7292%) 96.5185% (3.5447%) -

SVM-1MC 81.9259% (6.3356%) 90.7407 % (1.0030%) -

SVM-4M - - 93.3333% (0.7808%)

MFCC branch 22.3703% (5.2471%) 43.7037% (7.7454%) 21.7037% (5.3307%)

GSV branch 56.7407% (4.3361%) 89.8518% (2.9845%) 98.2222% (1.1152%)

Fusion branch 52.1481% (2.0717%) 86.4444% (4.3199%) 63.5555% (13.1018%)

in (Zeng et al., 2023), we also used 39-dimensional

MFCC vectors. The features were extracted using 16

ms Hamming windows with an 8 ms overlap.

Because the audio samples were only 8 seconds

long, the voice activity detection used in the non-

speech approach did not produce enough frames for

the GMM training. So, for the non-speech approach,

residual noise signals were obtained by subtracting

an estimated clean speech signal from the original

speech signal.

To enable a fair comparison between the accu-

racy of our approaches to those disclosed in (Zeng

et al., 2023), we emulated the results that would have

been obtained on the small sample of the CCNU Mo-

bile dataset by using the UBM trained on the TIMIT

dataset of the original paper. In every experiment, the

neural network training is done for 100 epochs using

a batch size of 16. The learning rate is 0.001 and de-

creases by 1/10 every 20 epochs. The Adaptive Mo-

ment Estimator (Adam) optimizer is used. Brand and

model identiﬁcation accuracy using Zeng’s UBM are

shown in Tables 11-12. The GMM hyperparameters

were ﬁxed for this experiment, so every classiﬁcation

was repeated 10 times and the mean and standard de-

viation were reported.

The MFCC branch of the network is the module

that works with the MFCCs only, the GSV branch is

the module that works with the GSVs only, and the

fusion branch is the module that fuses the results of

the previous two branches to obtain a ﬁnal classiﬁca-

tion. Implementation details can be found in (Zeng

et al., 2023). It is seen that without enough data, the

GSV branch performs better than the MFCC branch

and the fusion branch in both tasks.

The results of the traditional, the non-speech, and

the UBM approach, when ML, SVM, and neural clas-

siﬁcation are applied, are summarized in Tables 13-

14. The UBM here is also trained on the TIMIT

dataset. For the ML classiﬁcation, 64 component

GMMs are used. For the GSVs fed to the SVMs, 1

component is used, while for the SVMs applied to the

UBM-adapted GSVs, 4 components are used.

In both tasks, the non-speech MFCCs improve the

accuracy of all classiﬁcation approaches compared to

the traditional approach. This is also true when the

UBM is used. Examining the neural network clas-

siﬁcation a little closer (i.e., the last 3 table rows in

Tables 13 and 14), we see that the GSV branch us-

ing the UBM-adapted GSVs gives the best mean ac-

curacy for both brand and model identiﬁcation when

the neural classiﬁcation is employed. Zeng’s UBM

(Tables 11-12) yields similar results to our UBM, as

expected, except the fusion branch for model iden-

tiﬁcation, where Zeng’s UBM attains a higher mean

accuracy but a similar large standard deviation.

6 CONCLUSIONS

This paper has disclosed empirical evidence demon-

strating that either the non-speech or UBM ap-

proaches increase mobile phone identiﬁcation classi-

ﬁcation accuracy, even under data scarcity and noisy

recordings. The non-speech approach performs better

when a lot of speaker variability is observed since the

features extracted from the non-speech segments of

the signal are not corrupted by speaker-related noise.

The UBM approach performs better in the presence of

recording noise because of the probabilistic “match-

ing” of the extracted features with the UBM’s com-

ponents, which is robust to noisy features.

Considering future work, more experiments

should be conducted on other datasets, especially the

entire CCNU Mobile dataset if it becomes publicly

Mobile Phone Identiﬁcation from Recorded Speech Signals Using Non-Speech Segments and Universal Background Model Adaptation

799

available, to check whether our results on the small

sample are similar to the whole dataset. In the non-

speech approach, instead of the MFCCs, which are

designed based on the logarithmic scale of the human

auditory system, other features providing the same

resolution in all frequency ranges might also yield

better results since the features are extracted from

noise-like signals and not speech.

ACKNOWLEDGEMENTS

This research was supported by the Hellenic Foun-

dation for Research and Innovation (H.F.R.I.) under

the “2nd Call for H.F.R.I Research Projects to support

Faculty Members & Researchers” (Project Number:

3888).

REFERENCES

Alsulaiman, M., Muhammad, G., Bencherif, M. A., Mah-

mood, A., and Ali, Z. (2013). Ksu rich arabic speech

database. Information (Japan), 16(6 B):4231–4253.

Baldini, G. and Amerini, I. (2019). Smartphones identi-

ﬁcation through the built-in microphones with con-

volutional neural network. IEEE Access, 7:158685–

158696.

Baldini, G. and Amerini, I. (2022). Microphone identiﬁca-

tion based on spectral entropy with convolutional neu-

ral network. In Proc. 2022 IEEE International Work-

shop on Information Forensics and Security (WIFS),

pages 1–6. IEEE.

Berdich, A., Groza, B., Levy, E., Shabtai, A., Elovici,

Y., and Mayrhofer, R. (2022). Fingerprinting smart-

phones based on microphone characteristics from

environment affected recordings. IEEE Access,

10:122399–122413.

Berdich, A., Groza, B., and Mayrhofer, R. (2023). A survey

on ﬁngerprinting technologies for smartphones based

on embedded transducers. IEEE Internet of Things

Journal, 10(16):14646–14670.

Davis, S. and Mermelstein, P. (1980). Comparison of para-

metric representations for monosyllabic word recog-

nition in continuously spoken sentences. IEEE Trans-

actions on Acoustics, Speech, and Signal Processing,

28(4):357–366.

Dempster, A., Laird, N., and Rubin, D. (1977). Maxi-

mum likelihood from incomplete data via the em algo-

rithm. Journal of the Royal Statistical Society: Series

B (methodological), 39(1):1–22.

Garcia-Romero, D. and Espy-Wilson, C. Y. (2010). Au-

tomatic acquisition device identiﬁcation from speech

recordings. In Proc. 2010 IEEE International Con-

ference on Acoustics, Speech, and Signal Processing,

pages 1806–1809. IEEE.

Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G.,

and Pallett, D. S. (1988). Getting started with the

darpa timit cd-rom: An acoustic phonetic continuous

speech database. National Institute of Standards and

Technology (NIST), Gaithersburgh, MD, 107:16.

Giganti, A., Cuccovillo, L., Bestagini, P., Aichroth, P., and

Tubaro, S. (2022). Speaker-independent microphone

identiﬁcation in noisy conditions. In Proc. 2022 30th

European Signal Processing Conference (EUSIPCO),

pages 1047–1051. IEEE.

Hanilci, C., Ertas, F., Ertas, T., and Eskidere,

O. (2011).

Recognition of brand and models of cell-phones from

recorded speech signals. IEEE Transactions on Infor-

mation Forensics and Security, 7(2):625–634.

Hanilc¸i, C. and Kinnunen, T. (2014). Source cell-phone

recognition from recorded speech using non-speech

segments. Digital Signal Processing, 35:75–85.

Khan, M. K., Zakariah, M., Malik, H., and Choo, K.-K. R.

(2018). A novel audio forensic data-set for digital

multimedia forensics. Australian Journal of Forensic

Sciences, 50(5):525–542.

Kotropoulos, C. (2014). Source phone identiﬁcation using

sketches of features. IET Biometrics, 3(2):75–83.

Kotropoulos, C. and Samaras, S. (2014). Mobile phone

identiﬁcation using recorded speech signals. In Proc.

2014 19th International Conference on Digital Signal

Processing, pages 586–591. IEEE.

Leonzio, D. U., Cuccovillo, L., Bestagini, P., Marcon, M.,

Aichroth, P., and Tubaro, S. (2023). Audio splicing

detection and localization based on acquisition device

traces. IEEE Transactions on Information Forensics

and Security, 18:4157–4172.

Maher, R. (2009). Audio forensic examination. IEEE Signal

Processing Magazine, 26(2):84–94.

Martin, R. (2001). Noise power spectral density estimation

based on optimal smoothing and minimum statistics.

IEEE Transactions on Speech and Audio Processing,

9(5):504–512.

Qamhan, M., Alotaibi, Y. A., and Selouani, S. A. (2023).

Transformer for authenticating the source microphone

in digital audio forensics. Forensic Science Interna-

tional: Digital Investigation, 45:301539.

Reynolds, D., Quatieri, T., and Dunn, R. (2000). Speaker

veriﬁcation using adapted gaussian mixture models.

Digital Signal Processing, 10(1-3):19–41.

Reynolds, D. A. (1997). HTIMIT and LLHDB: speech cor-

pora for the study of handset transducer effects. In

Proc. 1997 IEEE International Conference on Acous-

tics, Speech, and Signal Processing, volume 2, pages

1535–1538. IEEE.

Sohn, J., Kim, N. S., and Sung, W. (1999). A statistical

model-based voice activity detection. IEEE Signal

Processing Letters, 6(1):1–3.

Zeng, C., Feng, S., Zhu, D., and Wang, Z. (2023). Source

acquisition device identiﬁcation from recorded audio

based on spatiotemporal representation learning with

multi-attention mechanisms. Entropy, 25(4):626.

Zeng, C., Zhu, D., Wang, Z., Wang, Z., Zhao, N., and

He, L. (2020). An end-to-end deep source record-

ing device identiﬁcation system for web media foren-

sics. International Journal of Web Information Sys-

tems, 16(4):413–425.

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

800