Analysis and Comparison of Algorithmic Composition Using

Transformer-Based Models

Shaozhi Pi

Daniel J. Epstein Department of Industrial & Systems Engineering,

University of Southern California, Los Angeles, U.S.A.

Keywords: Transformer, Music Generation, Multitrack Composition, Generative AI.

Abstract: As a matter of fact, transformers have revolutionized generative music by overcoming the limitations of earlier

models (e.g., RNNs) in recent years, which struggled with long-term dependencies. With this in mind, this

paper explores and compares four transformer-based models, i.e., Transformer-VAE, Multitrack Music

Transformer, MuseGAN as well as Pop Music Transformer. To be specific, the Transformer-VAE offers

hierarchical control for generating coherent long-term compositions. In addition, the Multitrack Music

Transformer excels in real-time multitrack music generation with efficient memory use. At the same time,

MuseGAN supports human-AI collaboration by generating multitrack music based on user input, while Pop

Music Transformer focuses on rhythmic and harmonic structures, making it ideal for pop genres. According

to the analysis, despite their strengths, these models face computational complexity, limited genre adaptability,

and synchronization issues. Prospective advancements, including reinforcement learning and multimodal

integration, are expected to enhance creative flexibility and emotional expressiveness in AI-generated music.

1 INTRODUCTION

Generative music, rooted in the early work of

algorithmic composition, has seen considerable

growth over the past few decades. Initially,

techniques such as rule-based systems dominated the

field (Ames, 1987; Ames, 1989). These methods were

innovative but limited in capturing human-composed

music's complexity, emotional depth, and creativity.

However, the advent of machine learning, and more

recently, deep learning, has significantly advanced

the capabilities of algorithmic composition.

Early deep neural networks' approaches to

generative music, such as recurrent neural networks,

focused on modeling the sequential nature of music.

While effective at capturing short-term dependencies,

these models struggled with maintaining long-term

coherence, a crucial aspect of human creativity in

music. The introduction of the Transformer model by

Vaswani et al., initially developed for natural

language processing, revolutionized the ability of AI

to generate music with complex structures (Vaswani

et al, 2017). With its self-attention mechanism, this

model allowed for preserving long-term

https://orcid.org/0009-0001-1671-9728

dependencies in sequential data, making it ideal for

symbolic music generation. One of the most

significant breakthroughs in this domain was the

Music Transformer (Huang et al., 2018), which

utilized self-attention to maintain coherence in music

generation across extended sequences. This

innovation marked a turning point in generating

music that exhibits structural integrity and creativity

and approaches human standards. Based on this

foundation, several models have been developed to

enhance specific aspects of music generation, such as

multitrack music and hierarchical structures.

The Multitrack Music Transformer introduced by

Dong et al., addressed the challenges of generating

multitrack compositions (Dong et al, 2023). These

challenges are particularly relevant for understanding

the interplay of creativity across different musical

elements, such as melody, harmony, and rhythm.

However, while this model made significant strides in

generating coherent music across multiple tracks,

capturing the interdependencies between these

elements remains challenging, especially in

generating emotionally engaging and structurally

complex music. Another significant contribution

184

Pi and S.

Analysis and Comparison of Algorithmic Composition Using Transformer-Based Models.

DOI: 10.5220/0013512300004619

In Proceedings of the 2nd International Conference on Data Analysis and Machine Learning (DAML 2024), pages 184-191

ISBN: 978-989-758-754-2

came from MuseGAN, a model that utilizes

Generative Adversarial Networks (GAN) to generate

multitrack music (Dong et al., 2018). Unlike

transformer-based models, which focus on attention

mechanisms, GAN introduce a competitive learning

process that can refine the generated music through

adversarial training. MuseGAN demonstrated how AI

could collaboratively generate music alongside

human input, with separate generators for each

musical track. Repetition, structure, and emotional

expressiveness are central to the human perception of

creativity in music. Dai et al. noted that while deep

learning models generating musically interesting

sequences, they often struggle with replicating the

nuanced repetition and variations found in human-

composed music (Dai et al, 2022). The Transformer

Variational Auto-Encoder (Transformer-VAE) was

introduced to address this limitation, combining the

strengths of transformer architectures with the

probabilistic modeling capabilities of VAE to

generate music that is both structurally coherent and

emotionally expressive (Jiang et al, 2020).

As generative music continues to evolve, new

models such as Jukebox by OpenAI and MusicLM by

Google Research have further pushed the boundaries

of AI's creative potential. Jukebox generates raw

audio and is capable of vocal synthesis across various

genres (Prafulla et al, 2020). In contrast, MusicLM

generates music from text descriptions, combining

pre-trained models to achieve stylistic and emotional

diversity (Andrea et al, 2023). These advancements

illustrate the rapid progress in generating music that

adheres to specific creative intentions, opening new

avenues for technical improvements and

philosophical considerations of creativity in machine-

generated music.

Despite the significant breakthroughs in music

generation, challenges persist. Current models often

struggle to fully capture the intricacies of musical

creativity, particularly in how they integrate complex

musical elements like melody, harmony, and rhythm.

This underscores the need for continuous

improvement and innovation in the field. Future work

will likely focus on enhancing transformer-based

models to capture these interactions more effectively,

potentially redefining the understanding of creativity

in music generation.

This paper evaluates the state-of-the-art

transformer-based models for music generation,

focusing on their strengths, limitations, and

contributions to the broader understanding of AI-

driven creativity. The paper aims to comprehensively

analyze these models, including the Transformer-

VAE, Multitrack Music Transformer, MuseGAN,

and Pop Music Transformer. It examines their

approach to multitrack generation, structural

coherence, emotional expressiveness, and technical

efficiency. The goal is to offer insights into how these

models have shaped the landscape of algorithmic

composition and to suggest future developments that

might enhance the creative capabilities of generative

models.

2 DESCRIPTIONS OF MUSIC

COMPOSING MODELS

Music generation using transformer models has

gained significant attention in recent years due to the

transformer's ability to model long-range

dependencies within sequential data. Unlike earlier

models, such as recurrent neural networks (RNN) and

convolutional neural networks (CNN), which face

challenges in capturing long-term dependencies and

contextual relationships, transformers utilize self-

attention mechanisms to process entire sequences

simultaneously, making them particularly effective

for music generation. The fundamental principle

behind transformer models is the self-attention

mechanism (Vaswani et al, 2017), which allows each

element in a sequence (in this case, a musical note or

event) to attend to every other element, regardless of

its position in the sequence. This ability to capture

dependencies across a range of time steps enables

transformers to maintain coherence over extended

periods, a critical feature in generating complex

musical structures. The model's architecture typically

consists of multiple layers of attention heads, each

focusing on different aspects of the sequence, such as

rhythmic patterns or harmonic progressions. This

multi-layered approach ensures that the model

captures local musical relationships and global

structural patterns.

Symbolic music is often represented in tokenized

form to apply transformers to music generation. Each

token may represent a note's pitch, duration, velocity,

or other musical attributes, such as instrument or

articulation. This tokenization allows the transformer

to process music similarly to text sequences, treating

each note or event as a word in a sentence. For

example, in the REMI (revamped MIDI-

derived events) format, music is broken down into

time events, position events, pitch events, and other

performance-based tokens, which the transformer

processes to generate a coherent musical output

(Huang et al., 2020).

Analysis and Comparison of Algorithmic Composition Using Transformer-Based Models

185

The training process for transformer models in

music generation typically involves feeding large

datasets of symbolic music, such as MIDI files, into

the model. These datasets may represent various

genres, styles, and compositional forms, allowing the

transformer to learn diverse musical structures. By

learning the relationships between different musical

elements, the transformer can generate new

compositions that reflect the micro-level details (e.g.,

note transitions, dynamics) and macro-level

structures (e.g., harmonic progressions, form). The

self-attention mechanism is particularly well-suited

for capturing these layers of detail because it enables

the model to focus on both close and distant

relationships within the music, such as how a chord

progression develops over several bars or how

rhythmic patterns evolve throughout a piece.

Furthermore, transformer models use positional

encoding to maintain information about the order of

musical events, which is critical for generating

coherent musical sequences. Since the transformer

architecture does not inherently understand the

sequential nature of time, positional encodings are

added to each input token, enabling the model to

discern the temporal structure of the music. These

encodings allow the transformer to generate music

that makes sense note-to-note and maintains

structural integrity across longer sequences.

Another crucial aspect of music generation using

transformers is autoregressive modeling. In this

approach, the model generates one token at a time,

predicting the next token based on the previous ones.

This method ensures that the generated music remains

contextually consistent, as each generated note or

event is conditioned on the preceding sequence. In

some cases, beam search or top-k sampling strategies

ensure that the model generates more diverse and

musically interesting outputs rather than simply

predicting the most likely following note at each step.

Overall, transformers have demonstrated

remarkable success in generating music exhibiting

local coherence (e.g., smooth transitions between

notes) and global structure (e.g., adherence to musical

form). These models can be further enhanced by

integrating additional deep learning techniques, such

as VAE and GAN, which help to capture even more

nuanced aspects of musical creativity, including

variability and expressiveness. As a result,

transformer-based models are now at the forefront of

computational music generation, combining

advanced deep-learning techniques with the

flexibility of symbolic music representation. These

models have enabled the creation of music that not

only mimics traditional compositional forms but also

explores new creative possibilities that challenge

conventional boundaries.

In addition to the core transformer architecture,

recent advancements have introduced hierarchical

models that enhance the generation of complex

musical structures. For example, by utilizing local

(measure-level) and global (phrase or section-level)

representations, these models allow the transformer to

understand the relationships between different

composition parts better. This approach helps

generate music that maintains thematic development

and variation across extended periods, ensuring that

compositions are more than just a series of disjointed

musical events.

Lastly, transformer models' adaptability to

various music styles and forms is another significant

benefit. Whether generating classical symphonies,

modern pop music, or experimental electronic

compositions, transformers can be fine-tuned to learn

the nuances of different genres. This flexibility is

crucial for applications where creative diversity and

stylistic fidelity are essential, such as music

production, game soundtracks, and collaborative AI-

driven composition tools.

In summary, transformer models in music

composition have opened up new possibilities for

generating technically proficient and creatively

expressive music. By leveraging the self-attention

mechanism, positional encoding, autoregressive

prediction, and hierarchical structuring, these models

can produce music that exhibits detailed intricacies

and larger-scale coherence, making them powerful

tools for advancing AI-driven music composition.

3 REALIZATION OF

ALGORITHMS

This section will explore the technical realization of

four major transformer-based models used for music

generation: the Transformer-VAE, Multitrack Music

Transformer, MuseGAN, and the Pop Music

Transformer. The copyright form is located on the

authors’ reserved area.

3.1 Transformer-VAE

The Transformer-VAE aims to combine the

advantages of VAE and transformers to generate

music that is structurally coherent but also

interpretable and flexible in latent space. The core

idea is to use a hierarchical model that first encodes

the local structure of music (e.g., measures) and then

DAML 2024 - International Conference on Data Analysis and Machine Learning

186

applies transformer layers to capture global

dependencies.

In the Transformer-VAE model, music is divided

into bars, and each bar is encoded into a latent

representation during the Input Encoding stage. This

is achieved by using a local encoder to capture the

essential musical features of each bar. The Global

Representation is generated by passing these bar-

level latent representations through the transformer

encoder. The encoder applies masked self-attention,

allowing the model to understand and capture the

relationships and dependencies between different

bars, thus creating a coherent global structure. During

the Latent Space Sampling stage, a latent code is

generated for each bar based on mean and variance

estimations provided by the VAE, which adds

variability and creative flexibility to the music

generation process. For the Music Reconstruction

phase, the transformer decoder, conditioned on

previously generated bars and their corresponding

latent variables, reconstructs the full music sequence,

ensuring continuity and coherence throughout the

composition. Finally, the model allows for Context

Transfer, enabling users to modify specific portions

of the generated music while maintaining the overall

structural integrity, offering creative control and

flexibility. A flow chart is given in Fig. 1.

Figure 1: Algorithmic flow for Transformer-VAE

(Photo/Picture credit: Original).

3.2 Multitrack Music Transformer

The Multitrack Music Transformer is designed to

handle complex multitrack compositions while

optimizing memory usage. It employs a decoder-only

transformer architecture, using multidimensional

input/output spaces to process each track separately.

Due to its efficient architecture, this model excels in

scenarios where real-time or near-real-time music

generation is required.

In the Multitrack Music Transformer, each music

event is represented as a tuple, which includes

attributes such as note type, pitch, duration,

instrument, beat, and position. This comprehensive

Data Representation allows the model to capture all

relevant aspects of a musical event across multiple

tracks. During the Sequence Encoding stage, these

tuples are fed into the transformer decoder, where the

self-attention layers process the sequences. The self-

attention mechanism helps the model understand the

relationships between musical events across different

tracks. The transformer uses an Event Prediction

mechanism for each event, following an

autoregressive approach to predict the next event

based on the sequence of prior events. This ensures

that the generated music evolves logically over time.

Lastly, Multitrack Coordination is essential for

maintaining harmonic and rhythmic dependencies

between instruments. The model ensures that the

tracks are generated and coordinated, respecting the

relationships between instruments to create a

coherent multitrack composition. A flow chart is

shown in Fig. 2

Figure 2: Algorithmic flow for Multitrack Music

Transformer (Photo/Picture credit: Original).

3.3 MuseGAN

MuseGAN employs Generative Adversarial

Networks (GAN) to generate multitrack music,

focusing on separate generators for each track. This

model allows for both fully automatic generation and

human-AI cooperative composition, where a human

provides one or more tracks, and the model generates

the accompanying tracks.

In MuseGAN, each track, such as bass, drums,

and piano, is generated using Track-wise Generators,

where separate GAN are trained for each track. Each

generator creates a piano roll for its respective track,

starting from a random noise vector input. The

generated tracks are evaluated by a Discriminator,

which determines whether the music generated by the

model is real or fake, improving the model's ability to

produce authentic-sounding music. During the

Training Process, the generator continuously attempts

to fool the discriminator, while the discriminator

becomes more adept at distinguishing between

generated and real music. To maintain Multitrack

Analysis and Comparison of Algorithmic Composition Using Transformer-Based Models

187

Synchronization, a shared latent vector ensures that

the tracks created by different GAN are harmonized,

preserving harmonic and rhythmic relationships

across all tracks. Additionally, Track Conditioning

allows for human-AI collaboration, where a user can

input a specific track (e.g., a melody or bassline), and

the model will generate the remaining tracks,

conditioned on the provided input to ensure

coherence and creative alignment. A flow chart is

given in Fig. 3.

Figure 3: Algorithmic flow for MuseGAN (Photo/Picture

credit: Original).

3.4 Pop Music Transformer

The Pop Music Transformer is explicitly designed to

generate pop music that focuses on rhythm and

harmonic structure. It leverages rhythmic features

within the transformer's architecture, producing

music that follows typical pop structures, including

verses, choruses, and bridges.

In the Pop Music Transformer, the Input

Representation involves tokenizing the music into

rhythmic and harmonic events, emphasizing beat and

meter to capture the rhythmic structure central to pop

music. The model uses Positional Encoding to ensure

that the temporal relationships between events are

respected, allowing it to maintain the characteristic

structure of pop compositions over time. During the

Autoregressive Generation phase, the model

generates one event at a time, predicting the next

event based on the sequence of previously generated

events, ensuring coherence and logical progression.

To enhance the diversity and musicality of the output,

Sampling Methods such as top-k or beam search are

applied during generation, allowing the model to

explore multiple creative pathways while staying

within the bounds of the musical style. A flow chart

is given in Fig. 4.

Figure 4: Algorithmic flow for Pop Music Transformer

(Photo/Picture credit: Original).

4 COMPARISON OF MODELS

This section will compare each generation model.

When comparing the four transformer-based music

generation models, i.e., Transformer-VAE,

Multitrack Music Transformer, MuseGAN, and Pop

Music Transformer. This paper takes into account

their functionality, computational efficiency, real-

world use cases, and creative contributions, both in

fully autonomous music generation and in

collaborative human-AI processes. Each model

excels in specific areas, making them more suitable

for different applications depending on the user's

needs.

4.1 Functionality

The Transformer-VAE is specialized and designed to

generate long-term musical structures with flexibility

and creative control. Its hierarchical model allows for

both local and global structure understanding, making

it ideal for compositions that must maintain thematic

consistency across an extended sequence. However,

its primary strength lies in providing interpretability

and context transfer, allowing users to modify

portions of the music and observe how it adapts.

The Multitrack Music Transformer generates

complex, multitrack compositions, such as orchestral

music, where multiple instruments must harmonize.

Its multidimensional input/output handling allows for

faster generation without sacrificing the coherence of

individual tracks. Due to its efficient memory and

computational use, Multitrack Music Transformer is

highly suited for real-time applications like automatic

accompaniment and live performances.

MuseGAN, on the other hand, is well-known for

its ability to assist in both fully automated and

collaborative music creation. With its track-wise

generators and adversarial training, MuseGAN

enables the generation of multitrack music, such as

jazz bands or rock ensembles, while allowing a

DAML 2024 - International Conference on Data Analysis and Machine Learning

188

human composer to control or input a specific track

for accompaniment generation. This model stands out

for its ability to create coherent yet distinct tracks

while maintaining a harmonic balance between them.

Lastly, the Pop Music Transformer is highly

optimized for generating structured, rhythmically

focused compositions typical of the pop genre. It

balances autonomous creativity and rhythmic control,

making it an excellent tool for producers aiming to

generate rhythmic loops or harmonic progressions

with specific stylistic characteristics. The model's

reliance on rhythmic structure ensures it stays within

genre-specific rules while allowing creative

flexibility.

4.2 Computational Efficiency

The Multitrack Music Transformer is the most

optimized for computational efficiency, given its

design to reduce sequence length and memory

consumption. It offers faster inference times,

particularly suitable for real-time or near-real-time

applications. MuseGAN and Transformer-VAE,

while less efficient, offer other strengths in the

creative control and adaptability they afford the user.

The Pop Music Transformer sits in the middle in

terms of computational needs, as it balances structure

and efficiency.

4.3 Creativity (Autonomous and

Human-AI Collaboration)

Creativity is a vital aspect when comparing these

models. The Transformer-VAE allows for high

creativity through its hierarchical structure, enabling

context transfer and offering deep insight into the

relationship between different composition parts. It

encourages more exploratory forms of creativity,

especially for users interested in modifying segments

of generated music while maintaining overall

coherence.

Multitrack Music Transformer is more focused on

maintaining structure and coordination between

multiple tracks, which limits its exploratory creativity

but makes it highly effective in settings where

harmonic and rhythmic consistency across tracks is

critical. This makes it ideal for orchestrations or

ensemble pieces where each part must fit together

seamlessly. MuseGAN stands out for its ability to

blend human and machine creativity. It is explicitly

designed to assist in human-AI collaboration by

allowing the user to input one or more tracks while

the model generates the rest. This provides a flexible,

creative process where human intuition and machine-

generated content can harmonize harmoniously.

While primarily focused on rhythm and harmony, the

pop music transformer offers creative flexibility

within the bounds of its genre. It encourages

creativity in producing rhythmically structured

compositions but is less adaptable to more

exploratory forms of music generation.

4.4 Additional Metrics: User-

Friendliness and Adaptability

One essential aspect not covered by computational

metrics is each model's user-friendliness and

adaptability. The Transformer-VAE and MuseGAN

offer a higher level of user control, making them

suitable for composers who want to interact directly

with the generated content. In contrast, the Multitrack

Music Transformer and Pop Music Transformer are

more automated, offering less user input but greater

efficiency in generating ready-to-use music.

Table 1: The comparastion of Models.

Model Functionality Computational

Effcienc

Creative Control Best Use Case Human-AI

Collaboration

Transformer-VAE Hierarchical music

heneration

Moderate High: Context

transfer and

structure control

Long-term

thematic

com

ositions

Low

Multitrack Music

Transformer

Multitrack and

complex

orchestration

High Medium: Focuson

track coordination

Orchestral,

real-time music

Low

MuseGAN Multitrack GAN-

based generation

Low High: User inputs

one track, model

generates the rest

Jazz, rock,

human-AI

collaboration

High

Pop Music

Transformer

Rhythm and

harmonic generation

Moderate Medium: Rhythmic

focus and pop

structure

Pop music,

Rhythmic loops

Low

Analysis and Comparison of Algorithmic Composition Using Transformer-Based Models

189

4.5 Comparison Summary

In conclusion, the comparative analysis shows that

each model has strengths and weaknesses depending

on the user's specific requirements. MuseGAN is

particularly useful for collaborative music generation,

while Multitrack Music Transformer shines in

multitrack composition for complex music.

Transformer-VAE offers the most creative flexibility,

especially for users interested in structural control,

and Pop Music Transformer excels at generating

rhythmically driven compositions with genre-specific

constraints. The summaries are given in Table 1.

5 LIMITATIONS AND

PROSPECTS

Despite the significant advancements in transformer-

based models for music generation, several

limitations hinder their full potential. These

limitations exist both at the algorithmic level and in

terms of real-world applicability, and addressing

them will be critical for the future evolution of AI-

generated music. One of the Transformer-VAE

model's primary limitations is its complexity and

computational demands. While the hierarchical

structure allows for greater control and flexibility in

generating long-term thematic compositions, it can be

computationally expensive. Balancing local and

global structures through multiple layers of encoders

and decoders increases training time and requires

considerable memory. Additionally, the model's

reliance on latent space sampling may lead to a loss

of fine-grained detail in musical generation, as it

simplifies complex musical elements into latent

variables that can sometimes lose nuance.

While the Multitrack Music Transformer's

efficiency in handling multitrack compositions is a

strength, it struggles with interdependencies between

tracks when complexity increases, such as in

orchestral compositions with numerous instruments.

Moreover, its focus on memory optimization and

faster inference times can come at the cost of creative

flexibility. It excels at keeping track of coherence but

is limited in exploring more innovative, experimental

music.

MuseGAN, although strong in collaborative

human-AI interaction, faces challenges in model

evaluation. The adversarial training inherent in GAN

often suffers from issues such as mode collapse,

where the generator produces repetitive outputs with

limited diversity. Furthermore, because the model

relies on separate generators for each track, it can

sometimes fail to fully synchronize across tracks,

resulting in minor harmonic or rhythmic dissonances.

The Pop Music Transformer, explicitly designed

for generating rhythmic and harmonic pop music, can

limit genre diversity. It excels in structured,

predictable genres like pop but lacks the adaptability

required for more complex or experimental

compositions. Additionally, while it generates

coherent music, its reliance on predefined rhythmic

structures may restrict creative freedom, making it

less useful for users seeking to push the boundaries of

conventional music.

Looking ahead, several key areas for

improvement can enhance these models' capabilities.

One potential direction is integrating reinforcement

learning to encourage greater musical diversity and

creativity. By allowing models to explore different

musical paths and receive feedback based on aesthetic

or stylistic goals, it would be possible to generate

more innovative and less predictable compositions.

For example, models like Transformer-VAE could

benefit from reinforcement learning to fine-tune the

generation process, balancing the trade-off between

structural coherence and creative exploration.

Another promising area for improvement is the

development of multimodal models that integrate

audio, text, and visual inputs. By training models on

datasets that combine musical scores, text

descriptions, and even visual cues (e.g., music videos),

models could generate music that aligns with specific

artistic intentions or narratives. This would open up

new possibilities for music generation, particularly in

fields like film scoring and video game soundtracks,

where the music must dynamically respond to visual

content. Improved evaluation metrics are also

essential for future research. Current models are often

evaluated based on subjective listening tests or

objective metrics like coherence. However, there is a

need for more sophisticated metrics that can measure

emotional expressiveness, creativity, and originality

in AI-generated music. By developing better tools for

assessing these qualities, researchers can improve

models like MuseGAN and Pop Music Transformer,

allowing them to generate music that follows

structural rules and evokes a deeper emotional

response.

In conclusion, while transformer-based models

for music generation have made great strides, they

could be better. Addressing the limitations related to

computational efficiency, inter-track dependencies,

genre diversity, and creative freedom will be crucial

to advancing the field. Future developments are likely

to focus on integrating reinforcement learning,

DAML 2024 - International Conference on Data Analysis and Machine Learning

190

expanding to multimodal inputs, and refining

evaluation metrics to create music that is not only

technically proficient but also emotionally and

creatively compelling.

6 CONCLUSION

To sum up, this paper has explored the functionality,

computational efficiency, and creative potential of

four key transformer-based models for music

generation: Transformer-VAE, Multitrack Music

Transformer, MuseGAN, and Pop Music

Transformer. Each model has its strengths and

limitations, from the flexibility and structural control

of Transformer-VAE to the efficiency and multitrack

harmony capabilities of the Multitrack Music

Transformer. MuseGAN excels in human-AI

collaboration, while the Pop Music Transformer

generates rhythmically focused compositions suitable

for pop music. These models showcase how AI can

contribute to the creative process by generating

coherent and structured music across various genres

and use cases. Future works in this field have great

potential to expand through improvements in

reinforcement learning, multimodal integration, and

evaluation metrics that better capture creativity and

emotional expressiveness. These advancements will

further enhance the ability of AI models to produce

innovative and emotionally compelling compositions.

In conclusion, these results contribute to the growing

body of research on algorithmic composition,

highlighting the strengths and challenges of current

models while identifying areas for future

development. By pushing the boundaries of AI-

generated music, these models represent significant

strides toward bridging the gap between human

creativity and machine learning.

REFERENCES

Agostinelli, A., Denk, T. I., Borsos, Z., et al, 2023.

Musiclm: Generating music from text. arxiv preprint

arxiv:2301.11325.

Ames, C., 1987. Automated Composition in Retrospect:

1956–1986. Leonardo 20(2), 169-185. 5.

Ames, C., 1989. The Markov Process as a compositional

Model: a survey and tutorial. Leonardo, 22(2), 175.

Dai, S., Yu, H., Dannenberg, R. B., 2022. What is missing

in deep music generation? A study of repetition and

structure in popular music. 23rd International Society

for Music Information Retrieval Conference, 11

Dong, H. W., Hsiao, W. Y., Yang, L. C., Yang, Y. H., 2018.

MuseGAN: multitrack sequential generative

adversarial networks for symbolic music generation

and accompaniment. Proceedings of the Thirty-Second

AAAI Conference on Artificial Intelligence and

Thirtieth Innovative Applications of Artificial

Intelligence Conference and Eighth AAAI Symposium,

32, 1.

Dong, H. W., Chen, K., Dubnov, S., McAuley, J., Berg-

Kirkpatrick, T., 2023. Multitrack music transformer.

ICASSP 2023-2023 IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP) pp.

1-5.

Huang, C. Z., Vaswani, A., Uszkoreit, J., Shazeer, N.,

Hawthorne, C., Dai, A., Hoffman, M., Eck, D., 2018.

Music Transformer: Generating Music with Long-Term

Structure. arXiv preprint arXiv:1809.04281.

Huang, Y. S., Yang, Y. H., 2020. Pop Music Transformer:

Beat-based Modeling and Generation of Expressive

Pop Piano Compositions. Proceedings of the 28th

ACM International Conference on Multimedia, 1180–

1188.

Jiang, J., Xia, G. G., Carlton, D. B., Anderson, C. N.,

Miyakawa, R. H., 2020. Transformer vae: A

hierarchical model for structure-aware and

interpretable music representation learning. ICASSP

2020-2020 IEEE International Conference on

Acoustics, Speech and Signal Processing

(ICASSP) 516-520.

Prafulla, D., Heewoo, J., Christine, P., Jong, W. K., Alec,

R., Ilya, S., 2020. Jukebox: A Generative Model for

Music. arxiv preprint arxiv:2005.00341.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A., Kaiser, U., Polosukhin, I., 2017.

Attention is All you Need. Advances in Neural

Information Processing Systems. Curran Associates,

Inc..

Analysis and Comparison of Algorithmic Composition Using Transformer-Based Models

191