EmoMusic and EMOPIA have shown promising
results in generating music that reflects specific
emotions like sadness, happiness, and calmness,
tailored to listeners’ expectations (Zhu et al., 2021).
These advancements demonstrate AI’s increasing
ability to understand and simulate the complexity of
human emotions through music, creating new
possibilities for personalized music creation, music
therapy, interactive digital art forms, and so on.
This study aims to explore how to implement
emotional expression in music creation, especially
sadness, happiness, and calmness, and introduce
some implementation methods of artificial
intelligence models through specific examples and
references. The framework of this study includes a
comprehensive analysis of variables for different
emotions, an evaluation of their effectiveness, and a
discussion of their limitations and potential
improvements. In the following sections, this study
will first outline how to create music with different
emotions by controlling different variables, e.g.,
melody, harmony, tempo, dynamics. Then, one will
implement specific emotions through computational
models, introduce typical results and principles, and
evaluate the results. Finally, this research highlights
the main findings, challenges, and future prospects of
this field.
2 MODELS AND EVALUATIOINS
AI implementing emotions in music creation relies on
advanced models that depend on deep learning,
generative algorithms, and music theory principles.
These models are designed to generate music that
reflects specific emotional states, such as sadness,
happiness, or calmness. This section explores key
models used in emotion-driven music creation, tools
and software that facilitate this process, and methods
used to evaluate the quality and effectiveness of
generated music.
One of the most remarkable models used in
emotion-based music generation is the Recurrent
Neural Network (RNN), particularly the Long Short-
Term Memory (LSTM) variant. LSTM networks are
very good at sequence prediction problems, making
them ideal for generating music as they can capture
temporal dependencies in musical compositions
(Briot, Hadjeres, & Pachet, 2020). LSTMs have been
widely used to generate sequences of notes that align
with the emotional tone specified by the input data.
For instance, an LSTM model trained on a dataset of
sad classical music pieces can generate compositions
that simulate the emotional patterns and
characteristics found in the training data, such as
minor keys, slower tempos, and low dynamics.
However, the effectiveness of LSTM-based models
largely depends on the quality and diversity of the
training datasets, as well as the model's architecture
and hyperparameters (Ferreira & Whitehead, 2019).
Another model that has gained popularity for its
ability to generate emotionally rich music is the
Generative Adversarial Network (GAN). GANs
consist of two neural networks, which are a generator
and a discriminator. They are trained simultaneously
through a competitive process. In the context of music
generation, the generator generates music samples
based on the emotion of the input, while the
discriminator evaluates the authenticity and
emotional consistency of these samples based on real
music data (Yang et al., 2017). Variants of GANs,
such as Conditional GANs (cGANs), have been used
to generate music on specific emotional labels, which
would give more targeted outputs. The advantage of
using GANs is their ability to learn complex
distributions and generate diverse musical
compositions. However, training GANs are
computationally intensive and require careful tuning
to avoid common pitfalls such as mode collapse
(Herremans et al., 2020).
Transformer-based models have also been used
for music generation tasks due to their powerful
sequence modeling capabilities. The Transformer
architecture has achieved great success in natural
language processing (NLP). It has been adapted for
music generation by representing musical elements as
sequences similar to words in a sentence. Models
such as the Music Transformer and GPT-based
architectures (e.g., OpenAI’s MuseNet) have
effectively captured long-term dependencies and
complex structures in music, enabling the generation
of compositions that evoke specific emotions (Huang
et al., 2018). Transformers can be fine-tuned on
emotion-labeled datasets to align the generated music
with an emotional expression that is desired. This
approach has been shown to successfully generate
coherent and expressive music in various genres and
emotional contexts.
To evaluate the quality and emotional accuracy of
AI-generated music, researchers have employed both
quantitative and qualitative methods. Quantitative
methods typically involve metrics such as note
density, pitch range, and rhythmic complexity, which
can be statistically analyzed to determine how well
the generated music matches specific emotional
profiles (Liu et al., 2021). For example, music
classified as “sad” may exhibit a lower average tempo
and use more minor chords than “happy” music.