Artiﬁcial Neural Networks Jamming on the Beat

Alexey Tikhonov

and Ivan P. Yamshchikov

2 a

Yandex, Berlin, Germany

Higher School of Economics, St. Petersburg, Russia

Keywords:

Music Generation, Beat Generation, Generation of Polyphonic Music, Artiﬁcial Neural Networks.

Abstract:

This paper addresses the issue of long-scale correlations that is characteristic for symbolic music and is a

challenge for modern generative algorithms. It suggests a very simple workaround for this challenge, namely,

generation of a drum pattern that could be further used as a foundation for melody generation. The paper

presents a large dataset of drum patterns alongside with corresponding melodies. It explores two possible

methods for drum pattern generation. Exploring a latent space of drum patterns one could generate new

drum patterns with a given music style. Finally, the paper demonstrates that a simple artiﬁcial neural network

could be trained to generate melodies corresponding with these drum patters used as inputs. Resulting system

could be used for end-to-end generation of symbolic music with song-like structure and higher long-scale

correlations between the notes.

1 INTRODUCTION

In recent years, there have been many projects dedi-

cated to neural network-generated music. For an ex-

tended survey of such methods see (Briot et al., 2019).

However there were several attempts to automate the

process of music composition long before the era of

artiﬁcial neural networks. The well-developed theory

of music inspired many heuristic approaches to auto-

mated music composition. The earliest idea that we

know of dates as far back as the nineteenth century,

see (Lovelace, 1843). In the middle of the twentieth

century, a Markov chain approach for music composi-

tion was developed in (Hiller and Isaacson, 1959), this

approach became relatively popular and was revisited

and improved in many followings works, see, for ex-

ample, (Hill, 2011). (Lin and Tegmark, 2017) have

demonstrated that music, as well as some other types

of human-generated discrete time series, tends to have

long-distance dependencies that cannot be captured

by models based on Markov-chains. Recurrent neu-

ral networks (RNNs) seem to be better at processing

data series with longer internal dependencies (Sun-

dermeyer et al., 2015), such as sequences of notes in

tune, see (Boulanger-Lewandowski et al., 2012).

Indeed, a variety of different recurrent neural net-

works such as hierarchical RNN, gated RNN, Long-

Short Term Memory (LSTM) network, Recurrent

https://orcid.org/0000-0003-3784-0671

Highway Network, etc., were successfully used for

music generation in (Chu et al., 2016), (Colombo

et al., 2016), (Johnson, 2017a), (Wu et al., 2019),

(Lattner and Grachten, 2019). Google Magenta re-

leased a series of projects dedicated to music gener-

ation. In particular, one should mention a music vae

model (Roberts et al., 2018) that could be regarded

as an extension of drum rnn

. It is important to dis-

tinguish the generative models like music vae and the

generative models for music that use a straightforward

language model approach and predict the next sound

using the previous one as an input. For example,

(Choi et al., 2016) used a language model approach to

predict the next step in a beat with an LSTM. Varia-

tional autoencoder (VAE), see (Bowman et al., 2016)

and (Semeniuta et al., 2017), on the other hand, al-

lows us to construct a latent space in which each point

corresponds to a melody. Such spaces obtained with

VAE or any other suitable architecture are of particu-

lar interest for different tasks connected with compu-

tational creativity since they can be used both to study

and classify musical structures, as well as to generate

new tunes with speciﬁed characteristics.

Generation of polyphonic music is more challeng-

ing that generation of a single melody line. (Lyu

et al., 2015) uses a set of parallel, tied-weight recur-

rent networks designed to be invariant to transposi-

tions. (Johnson, 2017b) generate music in two steps.

https://github.com/tensorﬂow/magenta

Tikhonov, A. and Yamshchikov, I.

Artiﬁcial Neural Networks Jamming on the Beat.

DOI: 10.5220/0010461200370044

In Proceedings of the 6th International Conference on Complexity, Future Information Systems and Risk (COMPLEXIS 2021), pages 37-44

ISBN: 978-989-758-505-0

First, a chord LSTM predicts a chord progression

based on a chord embedding. A second LSTM then

generates polyphonic music from the predicted chord

progression. (Chuan and Herremans, 2018) present

an approach for predictive music modeling and mu-

sic generation that incorporates domain knowledge in

its representation. Majority of the polyphonic meth-

ods that we know of still face certain difﬁculties with

temporal structure of the generated tunes. These prob-

lems are fundamentally connected with properties of

recurrent networks when applied in an end-to-end set-

ting.

In this paper, we propose a straight-forward ap-

proach the addresses temporal challenges and could

be used for end-to-end generation of tracks with song

structure. We construct a latent explorable drum pat-

tern space with some recognizable genre areas. We

test two different smoothing methods are used on

the latent space of representations. The obtained

latent space is then used to sample new drum pat-

terns. We experiment with two techniques, namely,

variational autoencoder and adversarially constrained

autoencoder interpolations (ACAI) (Berthelot et al.,

2018). We also train a system that uses drum ﬁles

as inputs and generates melodies corresponding to

the input drum patterns. Such loops of melodies and

drums could be combined into longer song-like tracks

in the given tempo and tonality.

The contribution of this paper is three-fold: (1)

we publish a large dataset of drum patterns, (2) de-

velop an overall representation of typical beat pat-

terns mapped into a two-dimensional space, and (3)

demonstrate that a simple artiﬁcial neural network

could be trained to generate tunes corresponding to

the drum patterns.

2 DATASET

Most of the projects that we know of used small

datasets of manually selected and cleaned beat pat-

terns. One should mention a GrooveMonkee free loop

pack

, free drum loops collection

and aq-Hip-Hop-

Beats-60–110-bpm

or (Gillick et al., 2019).

Unfortunately, majority of these datasets are ei-

ther restricted to one or two speciﬁc genres or contain

very limited amount of midi samples that does not ex-

ceed a dozen per genre. This amount of data is not

enough to infer a genre-related latent space. Inferring

this space, however, could be of utmost importance.

https://groovemonkee.com/collections/midi-loops

https://www.ﬂstudiomusic.com/2015/02/35-free-

drum-loops-wav-midi-for-hip-hop-pop-and-rock.html

https://codepen.io/teropa/details/JLjXGK

Due to the interpolative properties of the model that

could work on such space, one can produce inﬁnitely

diverse patterns that still adhere to the genre-speciﬁc

macro-structure. Groove MIDI (Gillick et al., 2019)

to a certain extent goes in line with the material pre-

sented in the papers yet it is not big enough for the

inference of the genre.

Here we introduce a completely new dataset of

MIDI drum patterns

that we automatically extracted

from a vast MIDI collection available online. This

dataset is based on approximately two hundred thou-

sand MIDI ﬁles, and as we show later is big enough

to infer the macroscopic structure of the underlying

latent space with unsupervised methods.

2.1 Data Filtering

The pre-processing of the data was done as follows.

Since the ninth channel is associated with percussion

according to the MIDI standard, we assumed that we

are only interested in the tracks that have non-trivial

information in it. All the tracks with trivial ninth

channels were ﬁltered out. This ﬁltering left us with

almost ninety thousand tracks. Additional ﬁltering

included an application of a 4/4 time signature and

quantization of the tracks. We are aware that such

pre-processing is coarse since it ultimately corrupts

several relatively popular rhythmic structures, for ex-

ample, waltzes, yet the vast majority of the rhythmic

patterns are still non-trivial after such pre-processing.

We believe that 4/4 time signature is not a prerequisite

for the reproduction of the results demonstrated here

and encourage researchers to experiment and publish

broad and diverse datasets of percussion patterns. In

order to reduce the dimensionality of the problem,

we have simpliﬁed the subset of instruments merging

the signals from similar instruments. For example,

all snares are merged into one snare sound, low and

low mid- toms into a low tom, whereas and high tom

and high mid-tom into a high tom. Finally, we had

split the percussion tracks into percussion patterns.

Every track was split into separate chunks based on

long pauses. If a percussion pattern that was thirty-

two steps long occurred at least three times in a row,

it was added to the list of viable patterns. Trivial pat-

terns with entropy below a certain minimal threshold

were discarded from the list of viable patters. Finally,

every pattern was checked to be unique in all its possi-

ble phase shifts. The resulting dataset includes thirty-

three thousand of unique patterns in the collection and

is published alongside this paper which is an order of

magnitude larger that midi available data sources.

https://github.com/altsoph/drum space

COMPLEXIS 2021 - 6th International Conference on Complexity, Future Information Systems and Risk

Table 1: Pseudo-code that describes ﬁltering heuristics used

to form the dataset of percussion patterns.

// Filtering original MIDI dataset

for new_track in MIDI_dataset do

if new_track[channel_9] is non-trivial

// Quantize with 4/4 signature

drum_track ← new_track[channel_9].quantize()

// Merge different drums according to a predeﬁned table

drum_track.merge_drums()

// Split drum track into chunks

for new_chunk in drum_track.split_by_pauses() do

if length(new_chunk) == 32 \

and new_chunk

∈ drum_track \

and entropy(new_chunk)>k

percussion_patterns.append(new_chunk)

// Filtering non-unique percussion patterns

for new_pattern in percussion_patterns do

// Create all possible shifts of a pattern

shifted_patterns ← new_pattern.all_shifts()

//Search for patterns that duplicate and delete them

for pattern in percussion_patterns do

if pattern ∈ shifted_patterns

delete pattern

[new_pattern] + percussion_patterns

Figure 1: Some examples of two-dimensional representa-

tion for drum patterns.

2.2 Data Representation

The resulting dataset consists of similarly structured

percussion patterns. Each pattern has thirty-two-time

ticks for fourteen possible percussion instruments left

after the simpliﬁcation. Each pattern could be repre-

sented as a 14 × 32 matrix with ones on the positions,

where corresponding instruments makes a hit. Figure

1 shows possible two-dimensional representations of

the resulting patterns.

We can also list all possible combinations of four-

teen instruments that can play at the same time tick.

In this representation, each pattern is described by

thirty-two integers in the range from 0 to 16383. Such

representation is straightforward and could be conve-

nient for processing fo the data with modern models

used for generation of discrete sequences (think of a

generative model with a vocabulary consisting of 2

words). The dataset ﬁnal dataset is published in the

following format:

• the ﬁrst column holds the pattern code that con-

sists of thirty-two comma-separated integers in

the range of [0, 16383];

• the second column holds four comma-separated

ﬂoat values that represent the point of this pattern

in the latent four-dimensional space, that we de-

scribe below;

• the third column holds two comma-separated ﬂoat

values of the t-SNE mapping from the four-

dimensional latent space into a two dimensional

one, see details below.

The model that we describe further works with a

two-dimensional representation shown in Figure 1.

3 MODELS AND EXPERIMENTS

In this papers we experiment with different autoen-

coders. Let us ﬁrst brieﬂy clarify the underlying prin-

ciples of these architectures.

3.1 Autoencoders

Autoencoders are a broad class of structures that pro-

cess the input x ∈ R

through an ’encoder’ z = f

(x)

parametrized by θ to obtain a latent code z ∈ R

The latent code is then passed through a ’decoder’

ˆx = g

(z) parametrized by φ to produce an approx-

imate reconstruction ˆx ∈ R

of the input x. In this

paper f

and g

are multi-layer neural networks. The

encoder and decoder are trained simultaneously (i.e.

with respect to θ and φ) to minimize some notion of

distance between the input x and the output ˆx, for ex-

ample the squared L2 distance ||x − ˆx||

Interpolating using an autoencoder describes the

process of using the decoder g

to decode a mixture

of two latent codes. Typically, the latent codes are

combined via a convex combination, so that interpo-

lation amounts to computing ˆx

= g

(αz

+(1−α)z

)

for some α ∈ [0, 1] where z

= f

) and z

= f

)

are the latent codes corresponding to data points x

and x

. Ideally, adjusting α from 0 to 1 will produce

a sequence of realistic datapoints where each subse-

quent ˆx

is progressively less semantically similar to

and more semantically similar to x

. The notion

Artiﬁcial Neural Networks Jamming on the Beat

of ’semantic similarity’ is problem-dependent and ill-

deﬁned.

VAE assumes that the data is generated by a di-

rected graphical model p

(x|h) and that the encoder

is learning an approximation q

(h|x) to the posterior

distribution p

(h|x). This yields an additional loss

component and a speciﬁc training algorithm called

Stochastic Gradient Variational Bayes (SGVB), see

(Rezende et al., 2014) and (Kingma and Welling,

2014). The probability distribution of the latent vec-

tor of a VAE typically matches that of the training data

much closer than a standard autoencoder.

ACAI has different underlying mechanism. It uses

a critic network, as is done in Generative Adversar-

ial Networks (GANs) (Goodfellow et al., 2014). The

critic is fed interpolations of existing datapoints (i.e.

ˆx

as deﬁned above). Its goal is to predict α from

ˆx

. This could be regarded as a regularization proce-

dure which encourages interpolated outputs to appear

more realistic by fooling a critic network which has

been trained to recover the mixing coefﬁcient from

interpolated data.

3.2 Architecture

In this paper, we experiment with a network that

consists of a 3-layered fully connected convolutional

encoder, and a decoder of the same size. The en-

coder maps the beat matrix (32*14 bits) into four-

dimensional latent space. The ﬁrst hidden layer has

sixty-four neurons; the second one has thirty-two.

The ReLU activations are used between the layers,

and a sigmoid maps the decoder output back into the

bit mask. Figure 2 shows the general architecture of

the network.

The crucial part of the model that is valid for fur-

ther experiments is the space of latent codes or the so-

called ’bottle-neck’ of the architecture shown in Fig-

ure 2. This is a four-dimensional space of latent repre-

sentations z ∈ R

. The structural difference between

the VAE and ACAI models with which we experiment

further occurs exactly in this bottle-neck. The archi-

tectures of the encoder f

and decoder g

are equiv-

alent. Effectively, VAE and ACAI could be regarded

as two smoothing procedures over the space of latent

codes.

3.3 Vizualization of the Obtained

Latent Space

To explore the obtained dataset, we have built an in-

teractive visualization that is available online

. and

http://altsoph.com/pp/dsp/map.html

is similar to the one described in (Yamshchikov and

Tikhonov, 2018). This visualization allows us to nav-

igate the resulting latent space of percussion patterns.

Training patterns are marked with grey and generated

patterns are marked with red. For the interactive vi-

sualization, we use a t-SNA projection of the VAE

space since it has a more distinctive geometric struc-

ture, shown in Figure 3.

Moreover, this visualization, in some sense, vali-

dates the data representation proposed above. Indeed,

coarsely a third of tracks in the initially collected

MIDIs had genre labels in ﬁlenames. After training

VAE we used these labels to locate and mark the ar-

eas with patterns of speciﬁc genres. Closely looking

at Figure 3 that shows a t-SNE projection of the ob-

tained latent space, one can notice that the geometric

clusters in the obtained latent space correspond to the

genres of the percussion patterns. The position of the

genres on the Figure were determined by the mean

of coordinated of the tracks attributed to the corre-

sponding genre. One can see that related genres are

closer to each other in the obtained latent space and

the overall structure of the space is meaningful. For

example the cloud of ’Punk’ samples is located be-

tween ’Rock’ and ’Metal’ clouds, whereas ’Hip-Hop’

is bordering ’Soul’, ’Afro’ and ’Pop’. The fact that

VAE managed to capture this correspondence in an

unsupervised set-up (as a by-product of training with

a standard reconstruction loss) demonstrates that cho-

sen data representation is applicable to the proposed

task, and the proposed architecture manages to infer a

valid latent space of patterns.

As we have mentioned above, we compare two

different latent space smoothing techniques, namely,

VAE and ACAI. It is important to note here that the

standard VAE produces results that are good enough:

the space mapping is clear and meaningful, as we

have mentioned above. At the same time, the ACAI

space seems to be smoother, yet harder to visualize in

two dimensions.

Figure 4 illustrates this idea, showing the two-

dimensional t-SNE mapping of the latent spaces pro-

duced by both methods with patterns that correspond

to the genre METAL marked with red dots. One can

see that ACAI mapping of a particular genre is not as

dense as VAE. Due to this reason, we use t-SNE pro-

jection of VAE space for the interactive visualization

mentioned above and throughout this paper.

However, we argue that the latent space produced

with ACAI is better to sample from and discuss it in

detail further.

COMPLEXIS 2021 - 6th International Conference on Complexity, Future Information Systems and Risk

Figure 2: Basic scheme of an autoencoder used to produce a latent space of patterns.

Figure 3: t-SNE projection of the latent percussion space

produced by VAE. Different areas correspond to speciﬁc

genres. One can see a clear macro-structure with hip-hop,

soul an afro beats grouped closer together and with rock,

punk and metal in another area of the obtained space.

3.4 Generating the Beat

The majority of the auto-encoder based methods gen-

erates new samples according to the standard logic.

One can sample an arbitrary point from the latent

space and use the decoder to convert that point into a

new pattern. In the case of VAE one can also narrow

the area of sampling and restrict the algorithm in the

hope of obtaining beats that would be representative

of the style typical for that area. However, an objec-

tive metric that could be used for quality estimation of

the generated samples is still a matter of discussion.

Such objective estimations are even harder in this par-

ticular case since the patterns are quantized and con-

sist of thirty-two steps and fourteen instruments. In-

deed, virtually any sequence could be a valid percus-

sion pattern, and human evaluation of such tasks is

usually costly and, naturally, subjective. We invite the

reader to estimate the quality of the generated sam-

ples on her own using the demo mentioned above. At

the same time we propose a simple heuristical method

Table 2: Comparison of the two smoothing methods. ACAI

seems to be way more useful for sampling since it produces

a valid percussion pattern out of a random point in the la-

tent space more than 50% of the time and is three times

more effective than VAE based architecture. In terms of the

heuristic entropy ﬁlter, VAE performs even worse than AE,

generating a lot of ”dull” samples with entropy below the

threshold.

Model % of patterns after ﬁltering

AE 28%

VAE 17%

ACAI 56%

Empirical patterns 82%

that allows putting the quality of different architec-

tures into relative perspective.

Table 1 contains pseudo-code that was used for

the ﬁltering of the original MIDI dataset. We suggest

using percussion related part of this ﬁltering heuristic

to estimate the quality of generated percussion pat-

terns. Indeed one can generate a set of random points

in the latent space, sample corresponding percussion

patterns with the decoder, and then apply the ﬁltering

heuristics. The resulting percentage of the generated

beats that pass the ﬁlter could be used as an estimate

of the quality of the model.

The percentage of the real MIDI ﬁles from the

training dataset that pass the ﬁnal entropy ﬁlter could

be used as a baseline for both architectures.

To have a lower baseline, we also trained a clas-

sic auto-encoder without any smoothing of the latent

space whatsoever. The examples of the tracks gener-

ated by it are also available online

This simple heuristic ﬁltering shows that VAE-

generated beats have a quality of about 17%. In

other words, on average, one out of six generated

beats passes the simple ﬁlter successfully. In the case

of ACAI, quality happens to be signiﬁcantly higher.

Namely, 56% of the produced beats satisfy the ﬁlter-

ing conditions. More than half of the generated pat-

https://github.com/altsoph/drum space

Artiﬁcial Neural Networks Jamming on the Beat

Figure 4: The beats from the area that corresponds to the genre metal on the VAE space projection (left) and the ACAI space

projection (right). VAE maps the tracks of the same genre closer together and therefore is beneﬁcial for the visualization of

the latent space.

terns passed the ﬁlters.

In order to have a baseline to compare both meth-

ods, one can look at the percentage of empirical MIDI

ﬁles that pass through the last entropy-based ﬁlter.

One can see that in this context the patterns ran-

domly sampled with ACAI are comparable with the

empirical ones that were present in the original MIDI

dataset.

3.5 Jamming on the Beat

Filtering original data we could not help to notice

that there are melodic patterns attached to every drum

loop. These melody patterns were played by various

instruments designated in a midi ﬁle. We collected the

dataset of these melody loops and present them along

with the drum patterns. Every melody is encoded with

three different embeddings that we concatenate in the

input vector. These embeddings include the instru-

ment playing the melody, the tone and the octave. To

simplify the generation we do not include the length

of the note played, this could be easily amended with

an additional embedding for the length pf a note. For

a detailed description of the melody preprocessing we

address the reader to (Yamshchikov and Tikhonov,

2020).

These melody embedding are concatenated with

the ﬂattened representation of the beat and form an

input vector, see Figure 5.

Let us brieﬂy describe Figure 5. The input is rep-

resented as a 496 dimensional vector. It consist of

a ﬂattened drum pattern encoded as a 32 × 14 ma-

trix, see Section 2.2, concatenated with three 16-

dimensional embeddings of the instrument, tonality

and octave. This input vector is ﬁrst projected with

496 × 64 ReLu layer, that is processed through three

Figure 5: A simple melody generation that generates

melodies corresponding to the rhythm.

more 64 × 64 ReLu layers, and a ﬁnal 64 × 8192 sig-

moid projection. The resulting 8192 vector is encod-

ing 64 time steps with 128 possible pitches on every

step.

As well as in other computational creativity tasks,

see, for example, (Agafonova et al., 2020), one could

signiﬁcantly improve the results of the generation

with automated heuristic ﬁltering. In the case of sym-

bolic music additional heuristic ﬁltering of the gen-

erated melodies includes some basic rules that help

to harmonize obtained tunes, i.e. if the tune dynamic

exceeds the diapason of two octaves we ﬁlter it. Such

cases are very rare for a given instrument within as

relatively short drum loop. Another example of the

heuristic could be: if the relative length of notes in

a tune exceeds some ﬁxed constant, say the shortest

time between to notes is ten times shorter that the

longest one, we could also ﬁlter that melody assur-

ing a relatively dense tune on the beat without pauses

lasting a half of the loop. These heuristics could be

tweaked and tinkered with. The resulting melody pat-

terns that we showcase in the repository are ﬁltered

with the following list of heuristics. We ﬁlter the loop

COMPLEXIS 2021 - 6th International Conference on Complexity, Future Information Systems and Risk

out if:

• it has sounds on three or less steps out of 64 time

ticks;

• if there is a pause that is longer that 16 consecutive

ticks;

• it includes two notes within a tone or a semitone

from one another that are played at once;

• there are more than two octaves between the low-

est and the highest note in a loop;

• the loop includes less than three different notes;

• at some point of the loop there are four or more

notes played an once;

• the most frequent note in the loop makes up for

more than three quarters of all notes played in the

loop;

• the tonality detection heuristics detects a tonal-

ity that is different for the target tonality that was

given in the input.

We developed a stand-alone tonality detector that is

rather intuitive yet was not applied to the task of mu-

sic generation before. To improve usability of the

heuristics and reproducibility of the work that is espe-

cially important due to its explorative nature we pub-

lish our code online

. Tonality heuristics lists all pos-

sible tonalities. For every given melody we score the

notes and consecutive note pairs that correspond to

possible tonalities and choose the most probable one.

You could test the heuristic on the dataset of midi ﬁles

with labelled tonalities that is included in the reposi-

tory.

The main idea behind melody generation that we

wanted to illustrate here is that drum pattern could be

an organizing structural layer used to achieve certain

synchronisation on a longer time scale and therefor

simplifying the generation of longer song-like struc-

tures, based on an interplay of certain repetitive pat-

terns.

4 DISCUSSION

Deep learning enables the rapid development of var-

ious generative algorithms. There are various limita-

tions that hinder the arrival of algorithms that could

generate discrete sequences that would be indistin-

guishable from the corresponding sequences gener-

ated by humans. In some contexts, the potential of

such algorithms might still be limited with the avail-

ability of training data; in others, such as natural lan-

guage, the internal structure of this data might be a

https://github.com/altsoph/tonika detector

challenge; ﬁnally, some of such tasks might be simply

too intensive computationally and therefore too costly

to use. However, percussion patterns do not have such

limitations. The structure of the data can be formal-

ized reasonably well and without signiﬁcant loss of

nuance. In this paper, we provide thirty-three thou-

sand thirty-two step 4/4 signature percussion drums

and demonstrate that such a dataset allows training a

good generative model. We hope that as more and

more data is available for experiments, percussion

could be the ﬁrst chapter to be closed in the book of

generative music.

Once the drum pattern progression is deﬁned one

could use very intuitive generative methods in com-

bination with tonality ﬁlter to generate the melody

over the generated beat. The resulting loops could

be rearranged in any progression providing longer

macro structures. The level of novelty could be fur-

ther controlled through purely combinatoric parame-

ters of the song structure. These endeavours, however,

are mostly either heuristically motivated or anecdotal

rather than data-driven. Generative models capable

of smooth interpolations between different rhythmic

patterns represent another set of new research ques-

tions. Finally, nuances of percussion alongside with

the datasets and the models that could capture these

nuances, for example see (Gillick et al., 2019), need

further research.

Aside from the generation of the symbolic music

there are plenty of open questions regrading genera-

tion of sound fonts, intonation and dynamics both on

micro and macro levels of the melody progression.

5 CONCLUSION

This paper presents a new huge dataset of MIDI

percussion patterns that could be used for further

research of generative percussion algorithms. Ir

also presents corresponding melody patterns that are

aligned with the given percussion tracks. This dataset

could be further used for symbolic loop generation.

The paper also explores two autoencoder based ar-

chitectures that could be successfully trained to gen-

erate new MIDI beats. Both structures have simi-

lar fully connected three-layer encoders and decoders

but use different methods for smoothing of the pro-

duced space of latent representations. Adversarially

constrained autoencoder interpolations (ACAI) seem

to provide denser representations than the ones pro-

duced by a variational autoencoder. More than half of

the percussion patterns generated with ACAI passes

the simple heuristic ﬁlter used as a proxy for the re-

sulting generation quality estimation. To our knowl-

Artiﬁcial Neural Networks Jamming on the Beat

edge, this is the ﬁrst application of ACAI to drum-

pattern generation.

The interactive visualization of the latent space is

available as a tool to subjectively assess the quality of

the generated percussion patterns.

Finally, the paper explores the possibility to gen-

erate melodies that correspond to the given input pat-

tern and demonstrates that this could be done with a

relatively straight-forward artiﬁcial neural network.

ACKNOWLEDGEMENTS

The authors would like to thank Valentina Barsuk for

her constructive advice and profound expertise.

REFERENCES

Agafonova, Y., Tikhonov, A., and Yamshchikov, I. P.

(2020). Paranoid transformer: Reading narrative

of madness as computational approach to creativity.

arXiv preprint arXiv:2007.06290.

Berthelot, D., Raffel, C., Roy, A., and Goodfellow, I.

(2018). Understanding and improving interpolation

in autoencoders via an adversarial regularizer. arXiv

preprint arXiv:1807.07543.

Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P.

(2012). Modeling temporal dependencies in high-

dimensional sequences: Application to polyphonic

music generation and transcription. In Proceedings

of the 29th International Coference on International

Conference on Machine Learning 2012, pages 1881–

1888.

Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Joze-

fowicz, R., and Bengio, S. (2016). Generating sen-

tences from a continuous space. In 20th SIGNLL Con-

ference on Computational Natural Language Learn-

ing, pages 10–21.

Briot, J.-P., Hadjeres, G., and Pachet, F.-D. (2019). Deep

learning techniques for music generation-a survey.

Choi, K., Fazekas, G., and Sandler, M. (2016). Text-based

lstm networks for automatic music composition. In

arXiv preprint.

Chu, H., Urtasun, R., and Fidler, S. (2016). Song from pi: A

musically plausible network for pop music generation.

In arXiv preprint.

Chuan, C.-H. and Herremans, D. (2018). Modeling tempo-

ral tonal relations in polyphonic music through deep

networks with a novel image-based representation. In

AAAI, pages 2159–2166.

Colombo, F., Muscinelli, S. P., Seeholzer, A., Brea, J., ,

and Gerstner, W. (2016). Algorithmic composition

of melodies with deep recurrent neural networks. In

arXiv preprint.

Gillick, J., Roberts, A., Engel, J., Eck, D., and Bamman,

D. (2019). Learning to groove with inverse sequence

transformations. arXiv preprint arXiv:1905.06118.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial nets. In

Advances in neural information processing systems,

pages 2672–2680.

Hill, S. (2011). Markov melody generator. Computer Sci-

ence Department, University of Massachusetts Low-

ell, Published on Dec, 11.

Hiller, L. and Isaacson, L. (1959). Experimental Music.

Composition with an Electronic Computer. McGraw-

Gill Company.

Johnson, D. D. (2017a). Generating polyphonic music us-

ing tied parallel networks. International Conference

on Evolutionary and Biologically Inspired Music and

Art, pages 128–143.

Johnson, D. D. (2017b). Generating polyphonic music us-

ing tied parallel networks. In International conference

on evolutionary and biologically inspired music and

art, pages 128–143. Springer.

Kingma, D. P. and Welling, M. (2014). Auto-encoding vari-

ational bayes. In arXiv preprint.

Lattner, S. and Grachten, M. (2019). High-level control of

drum track generation using learned patterns of rhyth-

mic interaction. In 2019 IEEE Workshop on Appli-

cations of Signal Processing to Audio and Acoustics

(WASPAA), pages 35–39. IEEE.

Lin, H. W. and Tegmark, M. (2017). Critical behavior in

physics and probabilistic formal languages. Entropy,

19(7):299.

Lovelace, A. (1843). Notes on l menabrea’s sketch of the

analytical engine by charles babbage, esq. In Taylor’s

Scientiﬁc Memoirs.

Lyu, Q., Wu, Z., Zhu, J., and Meng, H. (2015). Modelling

high-dimensional sequences with lstm-rtrbm: Appli-

cation to polyphonic music generation. In Twenty-

Fourth International Joint Conference on Artiﬁcial In-

telligence.

Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).

Stochastic backpropagation and approximate infer-

ence in deep generative models. ICML, pages 1278–

1286.

Roberts, A., Engel, J., Raffel, C., Hawthorne, C., and Eck,

D. (2018). A hierarchical latent vector model for

learning long-term structure in music. arXiv preprint

arXiv:1803.05428.

Semeniuta, S., Severyn, A., and Barth, E. (2017). A hybrid

convolutional variational autoencoder for text gener-

ation. In Proceedings of the 2017 Conference on

Empirical Methods in Natural Language Processing,

pages 627–637.

Sundermeyer, M., Schl

uter, R., and Ney, H. (2015). Lstm

neural networks for language modeling. Interspeech,

pages 194–197.

Wu, J., Hu, C., Wang, Y., Hu, X., and Zhu, J. (2019).

A hierarchical recurrent neural network for symbolic

melody generation. IEEE Transactions on Cybernet-

ics, 50(6):2749–2757.

Yamshchikov, I. and Tikhonov, A. (2018). I feel you:

What makes algorithmic experience personal? In EVA

Copenhagen.

Yamshchikov, I. P. and Tikhonov, A. (2020). Music genera-

tion with variational recurrent autoencoder supported

by history. SN Applied Sciences, 2(12):1–7.

COMPLEXIS 2021 - 6th International Conference on Complexity, Future Information Systems and Risk