voice cloning. Autotune is commonly used in real-
time during live performances, with a small delay that
is manageable with the right parameters and hardware.
3.3 Comparing Unaltered Recording
and with AI Voice Cloning
In this section, we present our observations when
comparing the unaltered recording against AI voice
cloning results. Figure 8 illustrates these signals with
an additional melodic range spectrogram of the
original singer as a reference.
Figure 8: (A, top) Unaltered recording, (B, middle) AI voice
cloned version, (C, bottom) Original singer.
The AI voice cloning process requires ample
training data for it to deliver the desirable results.
Training data in this case refers to voice recording
samples of the individual who will be the inferencing
voice. The inferencing voice will be mimicking the
voice from the signal to be processed. In our case, the
inferencing voice will be the singer of the unaltered
recording, while the signal to be processed is the
voice recording of the original singer.
In the Retrieval-Based Voice Conversion (RVC)
WebUI, a few samples of inference voices are
provided. These inference voices were trained with
nearly 50 hours of training data. However, for an
inferencing voice with acceptable quality, around 10
minutes of high-quality training data will be sufficient.
We used roughly 10 minutes of training data to create
an inferencing voice. Training an inferencing voice is
computationally expensive compared to autotune. It
took us around 20 minutes to train an inference voice
using 10 minutes of training data using a Windows
PC with an 11
th
Gen Intel Core i5 and an Intel Iris Xe
GPU. Training can be expected to perform quicker
with better and faster hardware. Currently, a real-time
implementation for AI voice cloning exists, using
pre-trained inference voices.
In Figure 8, three regions of interest are
highlighted, each with a different shape. Melody
contour within the highlighted circle for the AI voice
cloned output (Figure 8, middle) exhibits similarities
from both the unaltered recording (Figure 8, top) and
the original singer (Figure 8, bottom). The contour for
AI voice cloned output in this region shows more
distinctive steps, which closely resemble the original
singer yet also shows significant dips or valleys in the
middle of the contour within the highlighted circle,
which closely resembles the unaltered recording.
Similar can be observed for the regions above the
highlighted line and regions within the highlighted
square. The shape of the melody contour for the AI
voice cloning results closely resembles the contour of
the original singer, while the intensity or loudness of
the AI voice cloning results closely resembles the
intensity of the unaltered recording.
4 REFLECTIONS AND
CONSIDERATIONS
Autotune is portrayed as a transformative tool in
audio engineering, originally developed for pitch
correction but widely used across various musical
genres as an aesthetic choice. However, we need to
acknowledge the polarizing reception of autotune
within the music industry, with critics raising
concerns about its potential to homogenize vocal
performances and detract from authentic expression.
The emergence of AI voice cloning is a
groundbreaking advancement, enabling the
replication of human speech and singing with
unprecedented realism. AI voice cloning raises
important ethical considerations surrounding privacy,
consent, and identity manipulation, underscoring the
need for responsible practices in music production.
This study provided a comparative analysis of
autotune and AI voice cloning through observation
and analysis of melodic range spectrograms. These
spectrograms offer insights into the output of each
technology on automatic pitch correction,
highlighting differences in melodic contours and
intensity between the original recordings and
processed versions. By comparing the processed
versions to the original recordings and the
performances of professional singers, this study
offers evaluations of the output from each
technology.