Analysis of the Metrics and Evaluations Methods for Music
Ziteng Li
Kang Chiao International School Xi’an Qujiang Campus, Xi’an, China
Keywords: Music Evaluation, Deep Learning, Rhythm, Emotion Recognition, Hidden Markov Model (HMM).
Abstract: As a matter of fact, in recent decades, computer music has grown into a dominant force, revolutionizing both
music creation and evaluation methods. This study explores the evolution of music evaluation from traditional,
subjective approaches to more systematic, quantitative methods enabled by computational advancements. The
research focuses on key evaluation metrics, including emotion, rhythm, and similarity, and how models like
N-gram and Hidden Markov Models (HMM) capture melodic and rhythmic features. This research highlights
recent progress in using deep learning algorithms for music assessment and their application in tasks like
emotion recognition and music recommendation. Despite the successes, existing models often struggle with
complex emotional expressions and cross-cultural diversity in music. The findings suggest that future
improvements in music evaluation can be achieved through integrating advanced machine learning techniques
and multi-modal analysis. These results contribute to the development of more objective and comprehensive
evaluation methods, ultimately benefiting various applications in music classification, recommendation, and
automated composition.
1 INTRODUCTION
Computer music has leaped into the mainstream of
today's society after only a few decades of
development. Not only has it brought new approaches
to music creation, giving composers inspiration and
the possibility to pursue the extreme, but it has also
dramatically changed the way music is evaluated. The
history of computer music can be traced back to the
mid-20th century when advances in technology
enabled the generation of sounds and music through
algorithms. For example, The Silver Scale, composed
by Max Mathews at Bell Labs in 1957, is considered
the first piece of music generated through a computer.
As computer performance has increased, computer
music has expanded from simple audio generation to
complex automated composition and real-time music
processing. Nowadays, along with the disruptive
development of AI, computers are not only used for
music generation but also widely used for music
classification, recommendation, and evaluation
(Cope, 1989; Cone, 1981; Salamon et al, 1970).
With the development of technology, music
assessment has also experienced a shift from the
traditional subjective, authoritative individual-led
assessment to a more scientific and systematic
assessment method (Fink, 2014). Traditional music
assessment methods rely on the subjective evaluation
of listeners or experts, although this method has a
certain degree of authority, the results are often
characterized by strong personal bias. The masses
tend to blur their true feelings because of the herd
mentality and the judgment of the authorities.
Therefore, with the popularization of computer music,
more and more researchers are committed to
developing assessment methods that can objectively
quantify musical characteristics and reduce the
interference of human factors.
In recent years, academic research on music
assessment has gradually focused on how to quantize
music features and how to apply them to different
music analysis tasks. For example, some scholars
proposed a data-driven deep learning tone level
contour feature algorithm based on data and a hand-
designed melodic feature extraction algorithm based
on a priori knowledge (Yang & Chen, 2018), which
provides theoretical algorithmic support for the
automation of music recognition assessment. Some
other scholars proposed an emotion recognition
system based on music features (pitch, rhythm, timbre,
etc.) from the analysis level. These researches are of
extraordinary significance in application scenarios
such as emotion recognition for music evaluation,
music recommendation, and automated composition.
198
Li and Z.
Analysis of the Metrics and Evaluations Methods for Music.
DOI: 10.5220/0013512500004619
In Proceedings of the 2nd International Conference on Data Analysis and Machine Learning (DAML 2024), pages 198-202
ISBN: 978-989-758-754-2
Copyright © 2025 by Paper published under CC license (CC BY-NC-ND 4.0)
Meanwhile, with the rise of machine learning and
artificial intelligence technologies, deep learning-
based models are widely used for music assessment.
These models can automatically learn and extract
high-dimensional features in music, providing
powerful tools for music similarity analysis,
sentiment classification, and style identification. For
example, Convolutional Neural Networks (CNNs)
have achieved remarkable results in music sentiment
classification, audio signal processing, and music
style classification.
Existing assessment models still have limitations
in dealing with complex musical features, especially
in terms of emotional expression, structural
complexity, and cross-cultural diversity. The research
motivation of this paper is to explore the limitations
of existing music assessment methods and suggest
directions for improvement. The research framework
consists of the following parts: firstly, it introduces
the main points of consideration in music assessment,
such as emotion, tempo, and similarity; then it
discusses the quantitative metrics used in recent years,
which are quantitatively analysed by extracting
features such as pitch, harmony, etc.; then it
introduces the typical models used for music
assessment and their applications in emotion
recognition, recommender systems, and
categorization; and finally, it analyses the limitations
of the current methods and looks forward to possible
future improvements, such as the introduction of
machine learning and multimodal analysis.
2 DESCRIPTIONS OF MUSIC
EVALUATION
Music evaluation involves analysing various musical
elements to assess and categorize music, focusing on
aspects such as emotion, rhythm, and similarity.
These key considerations are vital in understanding
how music affects listeners and how it can be
quantitatively measured for various applications,
including recommendation systems, automated
composition, and emotional recognition.
One of the primary factors in music evaluation is
emotion. Music has the power to evoke a wide range
of emotions, from joy to sadness, and researchers
have long focused on developing methods to quantify
these emotional responses. Studies have shown that
specific musical features such as tempo, key, and
mode significantly influence emotional expression.
Major keys and fast tempos are often associated with
positive emotions, while minor keys and slower
tempos may evoke sadness or melancholy. However,
emotion recognition is not without its challenges.
Human emotions are complex and multifaceted, and
a single piece of music may evoke different emotions
in different listeners depending on their personal
experiences or cultural background. Moreover, the
same musical features may be interpreted differently
across genres. For instance, a minor key in classical
music is often associated with sadness, while in jazz
or blues, it may convey a sense of sophistication or
reflection. Emotion recognition models need to
account for such cross-cultural and genre-specific
differences to make more accurate predictions. In
particular, deep learning models such as
Convolutional Neural Networks (CNNs) and
Recurrent Neural Networks (RNNs) have been
employed to better capture the nuances of emotional
expression in music (Lin & Qi, 2018).
Another crucial element in music evaluation is
rhythm. Rhythm refers to the timing and arrangement
of sounds and silences in a piece of music. It plays a
critical role in defining the structure and flow of
music, influencing how it is perceived by listeners.
Researchers have explored various metrics to
evaluate rhythm, such as beat alignment, tempo
consistency, and syncopation. In addition to its role in
music perception, rhythm is also a key indicator of
technical skill. In genres like jazz or classical music,
the ability to maintain complex polyrhythms or
perform intricate syncopations is often associated
with mastery. In contrast, genres like electronic dance
music (EDM) emphasize steady, consistent rhythms,
where tempo stability is paramount. Rhythm-based
evaluation tools help in understanding both the
aesthetic and technical aspects of rhythm across
genres. In the realm of music evaluation, similarity
refers to the degree of resemblance between different
musical pieces. It plays a crucial role in various
applications, such as music recommendation systems,
automatic composition, and genre classification.
Music similarity is often analysed based on features
like melody, harmony, rhythm, timbre, and structure.
This section focuses on the different methods used to
quantify musical similarity and their applications.
The melodic similarity is one of the most
fundamental aspects of music comparison. It involves
analysing the sequence of pitches in a melody to
determine how closely two musical pieces align.
Traditional methods for measuring melodic similarity
rely on calculating the Euclidean distance between
pitch sequences. For instance, two melodies with
similar pitch contours would exhibit a shorter
Euclidean distance between their note sequences,
indicating higher similarity. However, this method
Analysis of the Metrics and Evaluations Methods for Music
199
does not account for temporal variations or rhythmic
complexities, which can significantly affect the
perception of melodic similarity.
Advanced models such as N-gram and Hidden
Markov Models (HMM) offer more sophisticated
ways of capturing melodic similarity by considering
not only pitch sequences but also the probability of
transitions between pitches. These models are
particularly useful in tasks like composer
identification or genre classification. By analysing
patterns in melodic transitions, these models can
identify stylistic tendencies unique to certain
composers or genres. For example, N-gram models
can capture recurring melodic motifs that characterize
a composer’s style, while HMM can track how pitch
changes unfold over time, offering a deeper analysis
of melodic structure.
Harmonic similarity evaluates how closely the
chord progressions or harmonic structures of two
pieces resemble each other. Since harmony plays a
vital role in defining the tonal character of music,
harmonic similarity analysis can reveal relationships
between pieces that might not be apparent through
melodic or rhythmic analysis alone. Techniques for
measuring harmonic similarity often involve
analyzing the intervals between chords and the
progression of these intervals over time. Chord-based
models such as the Tonnetz (tonal network) have
been employed to map harmonic relationships
geometrically, allowing for the comparison of chord
sequences based on their proximity in tonal space. For
example, the Tonnetz model enables the
identification of closely related chord progressions,
such as those found in pieces from the same musical
genre or period (Mor et al, 2021). Some application
examples are shown in Fig. 1.
Harmonic similarity is particularly relevant in
tasks such as genre classification, where certain
harmonic progressions are characteristic of specific
styles. For instance, classical music tends to employ
complex, modulating harmonic structures, while pop
music may rely on simpler, repetitive chord
progressions. Analyzing these differences allows for
a more nuanced understanding of genre distinctions.
Rhythm is another important factor in
determining music similarity, as it defines the
temporal structure of a piece. Rhythmic similarity can
be evaluated by comparing the timing and duration of
notes, the placement of accents, and the overall flow
of a piece (Yang et al, 2019). One common approach
to evaluating rhythmic similarity is through the use of
beat-synchronous features, which analyse rhythm at
regular time intervals. This allows for directly
Figure 1: Applications of Hidden Semi-Markov model in music and other fields (Mor et al, 2021).
DAML 2024 - International Conference on Data Analysis and Machine Learning
200
Figure 2: Scheme of music rhythm similarity detection based on Siamese CNN (ParK et al, 2022).
comparing rhythmic patterns across different musical
pieces, even if their tempos vary. For example, two
pieces with different tempos but similar rhythmic
patterns may still exhibit high rhythmic similarity.
Syncopation, the displacement of expected rhythmic
accents, is another key element that influences
rhythmic similarity. A typical sketch based on CNN
is shown in Fig. 2 (ParK et al, 2022). Genres like jazz
and funk are known for their heavy use of
syncopation, and recognizing these patterns is
essential for accurately assessing the similarity of
pieces within these genres. Together, these
elements—emotion, rhythm, and similarity—form
the foundation of modern music evaluation.
3 MODELS AND APPLICATIONS
Previous researchers have used different
mathematical models (e.g., N-gram model, HMM) to
obtain temporal information of melodic features. It
has been shown to be feasible to study state transfer
methods of musical features for melodic
classification or similarity computation through these
time-series mathematical models. For example,
Ruben Hillewaert et al. performed composer
classification experiments by obtaining melodic
timing features of string quartets composed by Haydn
and Mozart through N-gram models, and showed that
melody-specific N-gram models outperform global
feature models in composer identification
tasks.HMM models are also commonly used for
melodic similarity computation and melodic
categorization tasks, which are different from N-
grams and others that produce only one state chain.
gram and other temporal models that produce only a
chain of states, HMM is able to study the dependency
between two features (represented by hidden and
observed sequences, respectively) while studying the
temporal transfer of a feature, and thus achieves good
results in the melody classification task. For example,
some scholars used HMM to classify classical and
popular music melodies by five different composers
using the relative pitch and duration of the preceding
and following melodic notes as features, and showed
that note timing memories are reliable melodic
categorization methods when there is no timbral
texture or harmony involved, and that their
categorization results can be comparable to the level
of a music expert.Wei and Vercoe used HMM to
categorize classification of folk songs composed in
different countries and represented in different
ways.The results show that using HMM to classify
different folk songs based on length and pitch features
has a significant effect on the classification accuracy
ranging from 54% to 77% as compared to random
guessing.Viterbi's algorithm, a widely used dynamic
programming method, is proposed for the shortest
path problem of directed graphs. Problems described
using implicit Markov models can be decoded using
the Viterbi algorithm, and the main application
scenarios are digital communication, speech
recognition, machine translation, pinyin to Chinese
characters, and word splitting. The basic idea of the
algorithm is that the path from the starting point to the
Analysis of the Metrics and Evaluations Methods for Music
201
end point must be the shortest path (Suriya & Kiran,
2022).
The basic idea of the algorithm is that the path
from the starting point to the end point must be the
shortest path, and if this path is not the shortest path,
another shorter path will be chosen to replace the path
between the starting point and the end point. If this
path is not the shortest path, another shorter path will
be chosen to replace the path between the starting
point and the end point.
4 LIMITATIONS AND
PROSPECTS
Quantitative evaluation and categorization of melodic
similarity have been performed on many topics
including the above studies with good results.
However, most of the current studies consider
melodic pitch and rhythm as relatively independent
features for calculating melodic similarity and
categorization. However, in recent years, some
researchers have proved that melodic pitch and
rhythm are interrelated unities from the perspective of
music theory, and that pitch and rhythm have
different synchronization relationships in different
melodic styles. For example, it is proved through
music theory analysis that the melodic pitch and
rhythm of Chinese music and Western music have
different synchronous development relationships.
Therefore, it can be inferred that there may be
significant differences in the quantitative
computation results of the way different music styles
depend on melody and rhythm (Hu, 2020). However,
there is a lack of research on the differences in
melodic pitch and rhythmic dependency patterns in
different music styles. Verification of the differences
in melodic pitch and rhythmic dependency modes in
different styles of music requires a mathematical
model that can reflect the dependency relationship
between the two features. Future research may benefit
from incorporating advanced machine learning
techniques and multi-modal approaches to enhance
accuracy and depth in music evaluation.
5 CONCLUSIONS
To sum up, this research investigated the metrics and
evaluation methodologies used in music assessment,
with a focus on the quantification of melody, rhythm,
and harmony. This study investigated the
development of computer-based music evaluation
systems, emphasizing the efficacy of models such as
N-gram and HMM in capturing melodic and rhythmic
patterns. The research also examined recent deep
learning algorithms to music evaluation, indicating
their expanding importance in automated music
analysis. Despite improvements, existing models still
struggle to reflect the complexities of musical
emotions and cross-cultural variety. To improve the
accuracy and depth of music appraisal, future studies
will most likely include more advanced machine
learning approaches and multimodal analysis. This
study contributes to the ongoing efforts to develop
more objective and comprehensive techniques for
evaluating musical works.
REFERENCES
Cope, D., 1989. Experiments in musical intelligence (EMI):
Non
linear linguistic
based composition. Interface,
18(12), 117139.
Cone, E. T., 1981. The Authority of Music Criticism.
Journal of the American Musicological Society, 34(1),
1–18.
Fink, G. A., 2014. n-Gram Models. Markov Models for
Pattern Recognition: From Theory to Applications,
107-127
Hu, X., 2020. Recommendation algorithm based on
sentiment analysis. Southwest University of Finance
and Economics
Lin, Q., Qi, Z., 2018. Research on speech emotion
recognition based on mixed HMM and ANN models.
Computer Technology and Development,10, 74-78.
Mor, B., Garhwal, S., Kumar, A., 2021. A Systematic
Review of Hidden Markov Models and Their
Applications. Arch Computat Methods Eng 28, 1429–
1448.
Park, K., Baek, S., Jeon, J., Jeong, Y. S., 2022. Music
Plagiarism Detection Based on Siamese CNN. Hum.-
Cent. Comput. Inf. Sci, 12, 12-38.
Salamon, J., Rocha, B., Gómez, E. 1970. Musical genre
classification using melody features extracted from
Polyphonic Music Signals. IEEE International
Conference on Acoustics, Speech and Signal
Processing (ICASSP) 18.
Suriya, P. J., Kiran, S. 2022. Obtain Better Accuracy Using
Music Genre Classification Systemon GTZAN Dataset.
2022 IEEE North Karnataka Subsection Flagship
International Conference (NKCon), 1-5.
Yang, M., Chen, N., 2018. Cover song recognition model
based on deep learning and manual design feature
fusion. Journal of East China University of Science and
Technology, 5, 752-759.
Yang, Y., Jo, J., Lim, H., 2019. Unifying user preference
and item knowledge-based similarity models for top-N
recommendation. Personal and Ubiquitous Computing,
23(6), 901-912.
DAML 2024 - International Conference on Data Analysis and Machine Learning
202