A Reinforcement Learning Environment for Directed Quantum Circuit

Synthesis

Michael K

olle, Tom Schubert, Philipp Altmann, Maximilian Zorn, Jonas Stein

and Claudia Linnhoff-Popien

Institute of Informatics, LMU Munich, Munich, Germany

ﬁ

Keywords:

Reinforcement Learning, Quantum Computing, Quantum Circuit Synthesis.

Abstract:

With recent advancements in quantum computing technology, optimizing quantum circuits and ensuring reli-

able quantum state preparation have become increasingly vital. Traditional methods often demand extensive

expertise and manual calculations, posing challenges as quantum circuits grow in qubit- and gate-count. There-

fore, harnessing machine learning techniques to handle the growing variety of gate-to-qubit combinations is

a promising approach. In this work, we introduce a comprehensive reinforcement learning environment for

quantum circuit synthesis, where circuits are constructed utilizing gates from the the Clifford+T gate set to

prepare speciﬁc target states. Our experiments focus on exploring the relationship between the depth of synthe-

sized quantum circuits and the circuit depths used for target initialization, as well as qubit count. We organize

the environment conﬁgurations into multiple evaluation levels and include a range of well-known quantum

states for benchmarking purposes. We also lay baselines for evaluating the environment using Proximal Pol-

icy Optimization. By applying the trained agents to benchmark tests, we demonstrated their ability to reliably

design minimal quantum circuits for a selection of 2-qubit Bell states.

1 INTRODUCTION

The ﬁeld of quantum computing, including quantum

sensing, quantum meteorology, quantum communi-

cation, and quantum cryptography is recently receiv-

ing a lot of attention (T

oth and Apellaniz, 2014; Ek-

ert, 1991). Consequently, directed quantum circuit

synthesis (DQCS) involving quantum state prepara-

tion, which plays a vital role in the above-mentioned

technologies, gains more and more interest. Current

quantum circuit layouts range from rather straightfor-

ward ones, as used for Bell state preparation (Barenco

et al., 1995; Bennett and Wiesner, 1992; Nielsen

and Chuang, 2010), to highly sophisticated designs

including tunable parameters, as present in Varia-

tional Quantum Classiﬁers and Variational Quantum

Eigensolvers (Farhi and Neven, 2018; Schuld et al.,

2020; Peruzzo et al., 2013). Though many approaches

addressing quantum circuit development are known,

most of them focus on the optimization of already

existing circuits (F

osel et al., 2021; Li et al., 2023).

On the contrary, research regarding the circuit syn-

thesis is sparse, making manual methods still a state-

of-the-art technique for tackling these tasks. This cir-

cumstance gets especially problematic if the involved

quantum circuits increase in qubit number and gate

count, resulting in large state-spaces with an expo-

nential amount of possible layouts. Hence, to guar-

antee efﬁcient task-solving, it is crucial to develop an

approach that tackles the problem of DQCS and de-

creases the amount of required human insight.

On consideration of the aforementioned examples,

it becomes apparent that machine learning (ML) ap-

proaches are especially suited for subjects involving

complex calculations and elaborate combinatorial op-

timizations. Hence, we consider a ML-based tech-

nique for the DQCS problem. Since quantum circuit

construction and optimization does not entail learn-

ing from data in the classical ML sense, we instead

apply the concept of improving by repeated interac-

tion with a problem environment via reinforcement

learning (RL). While research for applying RL on

state preparation tasks is known, there is limited ex-

ploration of the underlying DQCS problem and the

implementation on actual quantum hardware utilizing

a set of distinct quantum gates (cf. (Gabor et al., 2022;

Mackeprang et al., 2019)). Similarly, the disassembly

of a proposed circuit into a sequence of valid quantum

gates is a crucial step in facilitating the transfer to a

real quantum device (Mansky et al., 2022).

Kölle, M., Schubert, T., Altmann, P., Zorn, M., Stein, J. and Linnhoff-Popien, C.

A Reinforcement Learning Environment for Directed Quantum Circuit Synthesis.

DOI: 10.5220/0012383200003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 1, pages 83-94

ISBN: 978-989-758-680-4; ISSN: 2184-433X

We introduce the Quantum Circuit RL environ-

ment designed to train RL agents on the task of

preparing randomly generated quantum states uti-

lizing the Clifford+T gate set, enabling the trained

agents solve the DQCS problem for arbitrary target

states. In our environment, we view the use of quan-

tum gates on quantum states as actions and the quan-

tum state as observation. The objective is to prepare

a speciﬁc target state efﬁciently, and success is mea-

sured by minimizing the number of gates needed to

construct the quantum circuit. We also evaluate Prox-

imal Policy Optimization algorithm agents on differ-

ent conﬁgurations of our environment to form a base-

line. Lastly, we benchmark the trained agents on a set

of well-known 2-qubit states.

This work is structured as follows. We ﬁrst give a

short overview of the related work in Section 2. We

then introduce our Quantum Circuit Environment for

circuit synthesis in Section 3, followed by our experi-

mental setup in Section 4 and results in Section 6. Fi-

nally, we conclude with a summary and future work

in Section 7.

2 RELATED WORK

In this section we introduce underlying concepts es-

sential for a clear comprehension of our research.

Further we provide an overview of prior work in the

ﬁeld of AI-assisted quantum circuit generation and

optimization, connecting it to our approach.

2.1 Quantum States

Quantum computing is an emerging technology dis-

tinguishing itself from classical computing funda-

mentally by the usage of so-called quantum bits or

qubits instead of classical bits as units of information

storage. Qubits among other aspects differ from their

classical counterparts by showing the ability of super-

position. This describes the ability of the qubit to not

only be in one of the discrete states 0 or 1, like a clas-

sical bit, but to be in any linear combination of 0 and

1, enlarging the available state-space from discrete to

continuous. Eq. 1 deﬁnes the quantum state of one

qubit in Dirac- and vector-notation.

⟩

= α

⟩

+ β

⟩





α, β ∈ C (1)

Another characteristic of qubits is the so-called entan-

glement, which describes the possibility of correlat-

ing the states of multiple qubits, enabling the setup of

complex relations between them. Another option for

the description of quantum states is the density matrix

representation given in Eq. 2.

ρ =

⟩⟨



|α|

αβ

∗

β |β|



α, β ∈ C (2)

To compare two density matrices and hence two quan-

tum states ρ and

ρ, the ﬁdelity F as given in Eq. 3 is

used as a measure.

F(ρ,

ρ) =



√



(3)

When solely states in vector representation are con-

sidered, this equation simpliﬁes to the expression

given in Eq. 4(Jozsa, 1994).

F(ρ,

ρ) = |

⟨

⟩

(4)

2.2 Quantum Circuits

Quantum computers use logic gates, represented by

unitary matrices denoted as U, to transform a quan-

tum state

⟩

into a new state

′

⟩

following Eq. 5.



′



= U

⟩

(5)

The unitarity aspect of the gates makes all operations

relying on these gates unitary and hence reversible. A

generic unitary matrix working as a quantum mechan-

ical operator on a single qubit can be deﬁned accord-

ing to Eq. 6.

U = e



cosθe

iρ

sinθe

iφ

−sinθe

−iφ

cosθe

−iρ



(6)

In the unitary matrix U, γ denotes a global phase mul-

tiplier, θ characterizes the rotation between compu-

tational basis states, while ρ and φ introduce relative

phase shifts to the diagonal and off-diagonal elements

respectively. Different quantum gates can be created

by selecting speciﬁc values for the involved parame-

ters. By leaving some of the involved parameters un-

deﬁned, the design of parameterized quantum gates is

possible. Another characteristic of a quantum gate is

its multiplicity, deﬁning the number of qubits the gate

acts on. Several different gates can then be combined

within a gate set (e.g. the strictly universal Clifford+T

gate set given in Table 1).

A quantum circuit is then formed by a sequence of

quantum gates acting upon a number of qubits, while

the overall number of gates within the quantum circuit

is called the circuit-depth. Hence the circuit repre-

sents one big transformation matrix, transforming the

incoming quantum state provided by the input qubits

into an altered quantum state obtained on the output

qubits.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

Table 1: The gates in the Clifford+T gate set, along with

their corresponding matrix representations and circuit sym-

bols. For the CNOT gate, the example provided illustrates

the scenario where the ﬁrst qubit acts as the control bit.

Symbol (Name) Matrix representation Circuit notation

I (Identity)



1 0

0 1



H (Hadamard)

√



1 1

1 −1





1 0

0 e



CNOT







1 0 0 0

0 1 0 0

0 0 0 1

0 0 1 0









1 0

0 e



2.3 Quantum State Preparation

Recently much effort has gone into the investigation

of quantum state preparation for different quantum-

based ﬁelds like quantum meteorology, quantum

sensing, quantum communication, and quantum com-

puting to name just a few examples (Krenn et al.,

2015; Krenn et al., 2021; Mackeprang et al.,

2019). Further exploiting computational resources,

approaches facilitating state preparation using ML

methods were done Mackeprang et al. investigated

the state preparation of 2-qubit quantum states using

RL algorithms showing their ability to generate Bell

states ﬁnding the same solutions as previously dis-

covered by humans (Mackeprang et al., 2019; Zhang

et al., 2019).

A different approach towards automated quantum

state preparation was investigated by Gabor et al. (Ga-

bor et al., 2022). Their study involved an approach

training an RL agent to generate quadratic transfor-

mation matrices transforming a initial state into a tar-

get state, both given in the density matrix representa-

tion. To account for unitarity of the transformation,

they utilized a QR-decomposition disassembling the

initially obtained matrix A according to A = U ·R,

while U represents a unitary and R an upper trian-

gular matrix. Subsequently, they used U as the uni-

tary transformation matrix. With this technique they

reached state ﬁdelities of up to > 0.99 for individual

target-states, but struggled with arbitrary state prepa-

ration, resulting in unrealistically long circuits (even

on small 2 qubit circuits) (Gabor et al., 2022).

Their research showed a promising way to ap-

proach the subject of quantum state preparation open-

ing up new possibilities, but also pointed out sev-

eral difﬁculties. The studies done in this project are

closely connected and partially based on the research

of Gabor et al., dealing with a similar kind of state

preparation problem, while focusing on the possible

improvements in the following. One issue of the ap-

proach chosen by Gabor et al. might be the con-

struction of potentially non-unitary quadratic matri-

ces, leading to costly QR-decomposition scaling with

a complexity between O(n

) and O(n

), if n is the di-

mension of the quadratic matrix (Parlett, 2000). To

improve this approach, we exclusively utilize unitary

transformations as provided by the Clifford+T gate

set, reducing computational demands and accelerat-

ing agent learning. To simplify the approach used by

Gabor et al. we substituted the density matrix rep-

resentation by a vector representation, reducing the

dimensions of the representation from N

to N poten-

tially lowering the computational costs.

2.4 Quantum Circuit Optimization and

Synthesis

Since manual optimization is a time-consuming,

error-prone process requiring a high amount of

knowledge, automation of quantum circuit synthe-

sis and optimization is crucial (F

osel et al., 2021).

Consequently, there is a rising focus on employ-

ing machine learning algorithms to tackle this chal-

lenge (Cerezo et al., 2021; Pirhooshyaran and Terlaky,

2021; Ostaszewski et al., 2021; Altmann et al., 2023).

An approach from F

osel et al. aims to optimize arbi-

trary generated quantum circuits with regards to their

complexity using a CNN approach, yielding an over-

all depth and gate reduction of 27% and 15% respec-

tively (F

osel et al., 2021). Further Zikun et al. im-

plemented an RL-based procedure utilizing a graph-

based framework to represent the structure of a certain

quantum circuit. Their proposed algorithm (QUARL)

then optimizes the respective circuit with regard to its

gate count, while maintaining its overall functional-

ity achieving a gate reduction ranging around 30%.

However, most of the approaches focus on the opti-

mization of already existing circuits disregarding their

initial synthesis. To address this issue, our study in-

cludes the initial quantum circuit synthesis into the

procedure (Li et al., 2023; Xu et al., 2022).

3 QUANTUM CIRCUIT

ENVIRONMENT

In this section, we introduce a versatile and scal-

able reinforcement learning environment designed for

A Reinforcement Learning Environment for Directed Quantum Circuit Synthesis

quantum circuit synthesis. This environment estab-

lishes a foundational platform for researchers to em-

ploy machine learning in discovering new and efﬁ-

cient quantum circuits for known problems. At each

step, a RL agent can place one quantum gate onto a

circuit, with the aim of crafting a circuit that maps

from an arbitrary initial state to an arbitrary target

state. We formulate the problem at hand as a Markov

decision process M = ⟨S, Γ, P , R, γ⟩ where S is a set

of states s

at time step t, Γ is a set of actions a

P (s

t+1

, a

) is the transition probability from s

t+1

when executing a

, r

= R(s

, a

) is a scalar re-

ward, and γ ∈ [0, 1) is the discount factor (Puterman,

2014). For our implementation, we use the Penny-

lane framework to efﬁciently simulate quantum cir-

cuits. The project is open-source

, distributed under

the MIT license and available as a package on PyPI.

In the following sections we elaborate on the details

of the Quantum-Circuit environment.

3.1 Observation Space

We deﬁne an observation in our environment as a real

vector of length 2

n+2

. Since a normalized n-qubit

quantum system can be expressed as a complex vector

in 2

dimensions, we start with two complex vectors

describing the current quantum state (v) and the de-

sired target quantum state (ˆv), each of length 2

v =













∩ ˆv =







ˆv







⇒ s =







Re(v

)

Re(v

)

Re(v

)

Im(v

)

Im(v

)

Im(v

)

Re( ˆv

)

Re( ˆv

)

Re( ˆv

)

Im( ˆv

)

Im( ˆv

)

Im( ˆv

)







∀

i∈{1··2

}

, ˆv

∈ C ∩ s ∈S

(7)

The concatenation of both results in a complex

vector of length 2

n+1

. Splitting up the complex coef-

ﬁcients of the resulting vector into the real and imag-

inary part we end up with the 2

n+2

real dimensions

characterizing the vector s in observation-space S.

https://github.com/michaelkoelle/rl-qc-syn

With the current state (v) included in s describing the

complete state of the quantum system, the environ-

ment is fully observable. Further the observation con-

tains the desired target state ( ˆv) to maintain a consis-

tent target perspective for the RL agent.

3.2 Action Space

The action space Γ is multi-discrete and deﬁned by

two ﬁnite sets of a speciﬁc size, out of which one el-

ement is drawn respectively to form an action. While

one set accounts for all gates included in the input

gate set G, the other set represents all possible com-

binations of qubits inputted into the respective gate,

given a total number of n qubits. Eq. 8 calculates the

size of the action-space with regards to the given G

and n, with n

max

being the number of qubits taken

by the gate g ∈ G processing the highest number of

qubits within the gate set.

Γ = [{0, 1.., |G|}, {0, 1..,C}]

with C =

(n −n

max

(8)

A proper mapping between the integers in the action-

space and the corresponding gate-qubit combination

is achieved in two steps. The ﬁrst value serves as an

index for a list representing the gate set, thus selecting

a speciﬁc gate. The second value indexes a listing of

all qubit permutations possible, given distinct values

for n and n

max

. Through this, it is decided which qubit

combination the selected gate is applied on. In case

the gate takes fewer qubits than present in the respec-

tive combination, the gate is simply applied on the

ﬁrst n

qubits of the permutation, while n

represents

the number of qubits taken by the gate. Implementing

the action space using the two-set architecture ensures

that in a random sampling case, the selection of every

gate and every combination is equally probable. On

the contrary, utilizing just one set including all possi-

ble gate-qubit combinations in the ﬁrst place, would

lead to an unequal weighting of gate selection. This

happens since gates taking a higher qubit number than

others are applicable to a larger number of different

qubit combinations and thus would appear more often

in the set. Hence we chose the two-set architecture.

3.3 Reward

The quantum circuit environment comes with two dif-

ferent reward functions that are user selectable, a step-

penalty reward (Eq. 9) and a distance-based reward

(Eq. 10).

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

(

L −l −1 if 1 −F < SFE

−1 otherwise

(9)

Note that l refers to the depth of the current quan-

tum circuit, L is the maximum circuit depth before

terminating the episode, F references the ﬁdelity from

Eq. 3 and SFE is the standard ﬁdelity error describing

the deviation of the ﬁdelity F from the value 1.











L −l −1 if 1 −F < SFE

−⌊

⌋·(1 −F) if 1 −F ≥ SFE ∧ l = 0

−1 otherwise

(10)

When comparing both equations, it becomes obvious

that they are equivalent apart from the case when the

episodes are ﬁnished without reaching the target. In

the step-penalty reward equation, the agent just re-

ceives another −1 penalty, whereas in the distance re-

ward equation, the ﬁnal penalty is proportional to the

distance of the current state to the target state (1−F).

Using this approach, the agent receives additional in-

formation about the closeness to the target, even if it

was not able to reach it completely. This additional

information may foster the learning procedure of the

agent, especially when more complex targets are in-

volved.

Figure 1: Comparison of two reward techniques: step-

penalty and distance. Each data point represents the average

performance of three runs, trained in a 2-qubit environment

with varied target circuit-depth.

We evaluated both reward functions on 2-qubit-

systems with targets characterized by circuit-depths λ

from 1 to 15, following the target-initialization algo-

rithm (see Section 3.4). We executed 3 runs for ev-

ery setting. Data points were obtained by averaging

the reconstruction circuit-depth Λ (see Section 4.2)

over the last 100 training episodes, with each agent

undergoing identical training steps. Analysis of the

reward technique comparison in Fig. 1 reveals that

while step-penalty and distance rewards exhibit sim-

ilar behaviors, key differences emerge. Speciﬁcally,

the step-penalty curve appears smoother and demon-

strates a higher quality, correlating to a 5-20% reduc-

tion in reconstructed circuit-depths Λ at lower λ ∈

{2, 3,4, 5}. Owing to its robust performance and sta-

bility, especially at lower difﬁculty targets, the step-

penalty technique was adopted as standard for all sub-

sequent experiments.

3.4 Target State Initialization

When initializing the environment, either a target

quantum state or a circuit-depth λ must be speciﬁed.

If a target quantum state is set, a maximum depth

L must also be deﬁned, upon exceeding the current

episode will terminate. We included this option to

provide the possibility to apply agents on speciﬁc,

ﬁxed target states.

Figure 2: State diagram displaying the target state gen-

eration algorithm, while gate-list and state-list are imple-

mented as actual lists and λ is the circuit-depth parameter

deﬁning the absolute number of gates, which must be ap-

plied to get to the target. The algorithm starts at the upper

right side of the ﬁgure.

If no target parameter is provided, the circuit-

depth parameter λ must be set, enabling the gener-

ation of a random target state per episode, as illus-

trated in Fig. 2. The parameter λ denotes the neces-

sary quantum gate count, applied via a deﬁned algo-

rithm, to reach the target state. To ensure each gate

meaningfully alters the circuit, a change condition is

enforced for every additional gate, permitting its ap-

plication only if the state change satisﬁes the condi-

tion 1 −F ≥ 0.001, with F corresponding to the ﬁ-

delity. The change condition is then determined be-

tween the state after gate application and every state

previously visited within the initialization procedure

A Reinforcement Learning Environment for Directed Quantum Circuit Synthesis

respectively. If one of the conditions is not satis-

ﬁed then the gate is not applied. By this ineffective

or neutralizing gate applications are avoided. For in-

stance, with the Clifford+T gate set, redundancy can

arise due to the self-inverse nature of Hadamard- and

CNOT-gates. Utilizing this method mitigates cyclical

patterns during target initialization and enhances the

approximation accuracy of the actual minimal gate

count to the provided λ. However, exact equivalence

is not assured. If a permitted gate cannot be applied

after 2 ×|G| consecutive tries, where |G| signiﬁes the

gate set cardinality, the generation process is reset to

prevent stagnation at a speciﬁc state.

3.5 Training Loop

Once the environment is initialized, the agent receives

the initial quantum state and target state as ﬁrst ob-

servation in the format speciﬁed in Eq. 7. The agent

now chooses an action in the format speciﬁed in Eq. 8

based on the observation. This action corresponds to

a gate applied to a speciﬁc combination of qubits. The

environment appends the received action to the list of

previously taken actions. The updated list of actions is

then applied sequentially to form the respective quan-

tum circuit present at the current step. Running the

updated circuit then produces the succeeding obser-

vation equivalent to the next state of the environment.

Following this procedure starting from a speciﬁed ini-

tial state (e.g. |00...0⟩), the agent tries to apply a se-

quence of gates in order to get to the deﬁned target

state. This process is displayed in Fig. 3. After the

Figure 3: Schematic of the sequential application of the up-

dated list of actions on the initial state |00..0 > transforming

it to the current state outputted by the environment.

application of the chosen gate, the number of steps

taken is increased by one, tracked by a step-counter

variable l. In case l reaches the maximal calculation

length L, the episode is aborted. L, if not deﬁned at

the environments’ initialization together with a spe-

ciﬁc target state, is determined by the otherwise given

circuit-depth parameter λ according to Eq. 11.

L = 2 ·λ (11)

The other case in which the current episode is termi-

nated occurs when the target state is reached. Hence

one episode of the environment can be deﬁned by tak-

ing steps starting from the initial state, either until the

target state is reached or until the number of already

taken steps equals the maximal calculation length L.

When the episode is terminated in case no target state

parameter is set, a new target is generated following

the algorithm described in Fig. 2 prior to the start of

the next episode. Additionally, the current state is re-

set to the initial state and the step counter variable l is

set to 0 when a new episode is started.

4 EXPERIMENTAL SETUP

In the following section we go into details about how

we set up our experiments. We explain our choice of

baseline algorithm and propose the reconstructed cir-

cuit depth as an evaluation metric speciﬁcally for the

quantum circuit environment. Lastly, we elaborate on

the training procedure and the used hyperparameters.

4.1 Baselines

In order to evaluate our environment designed for

quantum circuit synthesis, we conducted tests using

RL. Inspired by the approach of Gabor et al., we used

Proximal Policy Optimization (PPO) and Advantage

Actor-Critic (A2C) based agents as implemented in

the stable-baselines framework, while adding a ran-

dom agent for comparison reasons. Our aim was to

identify and select the highest-performing algorithm

among these, which was then utilized for our main

experiments.

Our primary baseline, the random agent, serves

as a rudimentary control, making arbitrary selections

from the action space. This agent applies quantum

gates onto the qubits without any informed guidance

or strategy, providing a baseline performance metric

which any proﬁcient strategy should surpass.

Next, we implemented a PPO agent, which opti-

mizes the policy by constraining the new policy to be

close to the old policy (Schulman et al., 2017). After

performing a minor hyperparameter search, particu-

larly focusing on the learning rate, the agent was eval-

uated on a 2-qubit circuit with varying circuit depths

over three runs.

Likewise, A2C was evaluated, a synchronous,

deterministic variant of A3C which uses advantage

functions to reduce the variance of the policy gradient

estimate (Mnih et al., 2016). Following a similar ex-

perimental protocol as with PPO, it was subjected to

a limited hyperparameter search, primarily adjusting

the learning rate, and further evaluated under identical

conditions on the 2-qubit circuit.

PPO distinctly outperformed A2C in synthesizing

2-qubit circuits across varied circuit depths, manifest-

ing more consistent and proﬁcient results over the

three runs. The evaluations were quantitatively as-

sessed based on the circuit-depth λ, ensuring a com-

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

prehensive appraisal across numerous scenarios and

depths. Henceforth, due to its demonstrable superior-

ity in our preliminary experiments, PPO was selected

as the algorithm of choice for succeeding experiments

and evaluations throughout our research.

4.2 Reconstructed Circuit-Depth Metric

Establishing a performance metric is crucial to main-

tain a general comparability within the RL environ-

ment, particularly when involving diverse agents and

parameterizations. Comparing different runs with

varying conﬁgurations is difﬁcult because the rewards

scale with the used parametrization. To address this,

we introduce the reconstructed circuit-depth metric,

designed to normalize rewards with respect to the

conﬁguration, thereby facilitating a straightforward

comparison across runs and giving a much more clean

reading on the actual performance of the agent. By

normalizing the number of gates used by the agent

to recreate the target n

through circuit-depth λ used

for target initialization, a generally valid measure

can be designed. This metric then represents how

well the agent performed, normalized on the prede-

ﬁned circuit-depth. Deﬁning the maximal calculation

length L according to Eq. 11 and extracting the re-

maining calculation length of the current episode, n

can easily be derived using Eq. 12.

= L −l (12)

Following the intuitive setup of the general measure

mentioned above, we deﬁne a metric called the re-

constructed circuit-depth Λ via Eq. 13.

Λ[%] =

·100% (13)

Achieving Λ = 100% signiﬁes that the agent has

found a method to recreate the target with equal

circuit-depth as the initial generation algorithm. The

metric’s limit extends to 200%, considering that the

maximum n

value equals L. Consequently, recon-

structed circuit-depths ranging between 100% and

200% or below 100% indicate respectively longer or

shorter gate sequences found by the agent compared

to the intended gate sequence.

5 TRAINING AND

HYPERPARAMETER

OPTIMIZATION

To assure a robust and replicable training process,

each model conﬁguration was trained under consis-

tent conditions on a Slurm cluster, utilizing Intel(R)

Core(TM) i5-4570 CPU @ 3.20GHz and Nvidia 2060

and 1050 GPUs, for a substantial total of 1,700,000

steps (56667 - 1700000 episodes, dependent on the

settings) per run. Three different seeds, namely

{1, 2,3}, were used for all experiments to obtain con-

sistent and reliable results, outputting an average and

a standard deviation for each data point. If not further

speciﬁed, we use a 2 qubit system with a circuit depth

of 5 and a SFE of 0.001. Furthermore, we use an ini-

tial state of |00...0 >, since it is the ground state of

most quantum hardware and thus a prominent start-

ing point (Schneider et al., 2022; Kaye et al., 2007;

Blazina et al., 2005).

A hyperparameter search was conducted,

focusing on the learning rate, using a grid

search approach across the following candidates:

0.00001, 0.0001,0.0003, 0.0005,0.0007, 0.001, 0.01.

For the Proximal Policy Optimization (PPO) al-

gorithm, a learning rate of 0.001 was identiﬁed as

optimal and was subsequently utilized in all related

experiments. Additionally, the clipping parameter

for PPO was set to 0.2 to ensure stable and reliable

policy updates. On the other hand, the Advantage

Actor-Critic (A2C) algorithm demonstrated optimal

performance with a learning rate of 0.00001.

This meticulous approach to training and hyper-

parameter optimization lays a solid foundation for

the subsequent experiments, ensuring that the derived

results and insights are both reliable and grounded

in a systematic exploration of the model’s parameter

space.

6 RESULTS

A main objective of our project was the implemen-

tation of a RL environment capable of training RL

agents on the DQCS problem. Another one was to

study the DQCS itself. Therefore, we now conduct

a variety of experiments to examine the task. For

the setup used in the following experiments, see Sec-

tion 4.

A comparison of PPO and A2C used with the re-

spective optimized settings showed a clear superior-

ity of the PPO algorithm. We evaluated the last 100

episodes of the respective runs of the agents on 2-

qubit targets with circuit-depth λ = 5. The recon-

structed circuit-depth Λ reached by the PPO agents

was 112.5±35.3%, while the A2C agents produced a

Λ of 195.3±6.5%. In comparison the random base-

line, corresponding to an untrained agent, exhibited a

Λ = 199.6 ±2.8%. Due to the better performance of

PPO it was selected for all following experiments.

A Reinforcement Learning Environment for Directed Quantum Circuit Synthesis

6.1 Qubit – Circuit-Depth Relationship

In order to explore the complexity of the DQCS prob-

lem, we investigated the training of PPO-agents on

various systems differing in their qubit numbers and

circuit-depths λ. We trained agents on systems of

2, 3, 4, 5, and 10 qubits utilizing circuits of depths

λ from 1 to 15 respectively. We obtained the data

displayed in Fig. 4 by averaging the reconstructed

circuit-depth Λ of the respective trained agents over

the last 100 episodes of the training run. Analyz-

Figure 4: Agents’ performances on targets with different

qubit numbers n (2, 3, 4, 5, 10) while the circuit-depth λ

is varied (1-15). Every data point represents the average

performance of 3 agents trained on systems possessing the

respective n and λ settings.

ing the curves, two main trends get apparent. The

ﬁrst observed trend is the increasing reconstructed

circuit-depth (Λ) as the target circuit-depth (λ) rises,

resembling a sigmoid curve. A higher target circuit-

depth (λ) naturally demands more intricate solutions

from the agent for their preparation. Starting from the

left side of the plot, the reconstructed circuit-depth

(Λ) ranges from 100% to 140% indicating scenarios

where the agent’s solutions resemble the paths used

for target initialization. The curves exhibit a linear

rise before eventually asymptotically converging to-

wards a Λ of 200%. This indicates the agent’s inabil-

ity to recreate the target at this level of complexity.

The second trend involves a decrease in agent per-

formance as the qubit number (n) increases, reﬂect-

ing the increased complexity of the target state. This

is evident from the steeper rise in the reconstructed

circuit-depth (Λ) occurring at lower λ-values for tar-

gets with higher n compared to those with lower n. Of

particular interest is the situation at λ = 3 and n = 2

(represented by the blue curve), where the curve tan-

gentially intersects the 100% mark. This scenario

offers two possible explanations. First, it could im-

ply that the agent perfectly recreates every target us-

ing an equivalent circuit-depth as applied during ini-

tialization. Alternatively, it suggests that the agent

achieves the desired states with a lower circuit-depth

and, consequently, fewer quantum gates than initially

used to create the targets. Consequently, the aver-

age reconstruction circuit-depth (Λ) can reach 100%

or even drop below it. This observation, considering

the advanced algorithm used for target initialization,

provides evidence of the high level of optimization

achieved by the trained agents.

6.2 Benchmarking Analysis

In the following section, we set up a benchmarking

framework for the quality validation of the trained

agents and for the comparison of different RL algo-

rithms on the DQCS problem. We focus on the exam-

ination of comparatively simple systems for bench-

marking, involving 2-qubit targets only. To facilitate

the benchmarking, we conducted two different ap-

proaches. The ﬁrst method measures the performance

of an agent on a set of randomly generated targets of a

certain circuit-depth λ, the other applies the agent on

a set of speciﬁc well-known target states.

6.2.1 Evaluation Levels

The difﬁculty of the target circuit is dependent on two

different factors, the number of qubits and the circuit-

depth λ used for target initialization. However, setting

the qubit number to 2 leaves us with λ as the only pa-

rameter. Based on the data obtained in Section 6.1,

we are able to propose a splitting of the parameter-

space of λ [1,15] into three evaluation levels ’easy’,

’medium’, and ’hard’. According to Fig. 4, the ﬁrst

segment describes a development varying around a re-

constructed circuit-depth of 110%. The second region

can be deﬁned as an interval of linear rise and the last

section is characterized by an asymptotic convergence

against a reconstructed circuit-depth of 200%. Table 2

contains the deﬁnition of the derived levels.

Table 2: The deﬁnition of the three evaluation levels ’easy’,

’medium’ and ’hard’ within the circuit-depth λ interval of

[1, 15].

Level Set of included circuit-depths λ

easy {1, 2, 3, 4, 5}

medium {6, 7, 8, 9, 10}

hard {11, 12, 13, 14, 15}

The data displayed in Fig. 5 was obtained by ap-

plying agents trained on 2-qubit targets with different

circuit-depths λ, ranging from 1 to 15 on 100 ran-

dom targets of the respective evaluation level, while

setting all other parameters to the standard values.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

For this experiment, the upper bound of possible ap-

plied actions, corresponding to the maximal calcula-

tion length L was set to 30 steps. It’s evident that each

segment has been successfully solved by at least one

agent with an average gate number below the maxi-

mum step count. When examining the bar chart, sev-

eral trends become apparent. Firstly, there is a clear

correlation between evaluation levels and the average

gate number required for target preparation. Targets

categorized as ’hard’ typically demand more gates, on

average, for their preparation compared to those cate-

gorized as ’medium’ or ’easy.’ This trend arises from

the general need for more gates when dealing with

higher circuit-depth (λ) targets. Another noteworthy

observation pertains to the absence of a shift in peak

performance towards agents trained on high circuit-

depth (λ) targets when transitioning from ’easy’ to

’medium’ and ’hard’ evaluation levels. Intuitively,

one might expect such a shift, given that changing

the evaluation level involves applying agents to target

states with different average circuit-depths (λ). Con-

sequently, it would be reasonable to anticipate a shift

of the best-performing agents towards those trained

on a circuit depth (λ) close to the average circuit-

depth of the respective evaluation level. However, this

expected correlation seems to be lacking, indicating a

weak dependency between these variables.

Figure 5: The average number of gates n

applied by PPO-

agents trained on different circuit-depths λ, when applied

on 100 random targets from the evaluation levels ’easy’,

’medium’ and ’hard’ respectively.

The experiment showed a high similarity of the

best-performing agents for the respective evaluation

levels. Speciﬁcally, agents trained with λ values

ranging from 4 to 7 exhibited notable performance.

Agents with λ values of 6 and 7 consistently ranked

within the top 3 across all regions, while λ = 5 ap-

peared in the top list of 2 sections. This shared

performance trend may be attributed to the initial

optimization, which emphasized agents with similar

settings. For the subsequent benchmarking of well-

known states, we chose the best-performing agents on

the respective evaluation levels (agents trained with

λ = 4, 5, and 6) as our candidates.

6.2.2 Reconstructing Well-Known States

As mentioned earlier this benchmarking method

again focuses on 2-qubit systems only. We composed

the set out of states well-known in the quantum com-

munity, including the four basis states of the 2-qubit

state-space

⟩

and

⟩

, the completely

mixed state

(

⟩

) and the four

2-qubit bell states

√

(

⟩

√

(

⟩

−

⟩

√

(

⟩

) and

√

(

⟩

−

⟩

). We divided the

chosen benchmark states into subgroups representing

different levels of evaluation, according to the min-

imal circuit-depth necessary to prepare the respec-

tive states. We obtained the minimal sequences start-

ing from the ground state

⟩

, using a brute-force

searching algorithm trying all possible combinations

of gates included in the Clifford+T gate set. Accord-

ing to the minimal circuit-depth, we divided the set

into three subgroups (easy: 0-2 gates, medium: 3-4

gates, hard: > 5 gates) The results of this classiﬁ-

cation are displayed in Table 3. Subsequently, we

Table 3: All states included in the 2-qubit set, divided into

subgroups of different evaluation levels, listed together with

the minimal number of quantum gates necessary for their

preparation using only gates contained in the Clifford+T

gate set.

State Minimal

circuit-

depth

Level

⟩

0 easy

(

⟩

) 2 easy

√

(

⟩

) 2 easy

⟩

4 medium

⟩

4 medium

√

(

⟩

−

⟩

) 4 medium

⟩

5 hard

√

(

⟩

) 5 hard

√

(

⟩

−

⟩

) 7 hard

tested agents trained with λ = 4, 5, and 6 on the de-

signed set of states. The outcomes of this investiga-

A Reinforcement Learning Environment for Directed Quantum Circuit Synthesis

tion are displayed in Fig. 6, while the targets are or-

dered according to the rise in their minimal required

circuit-depth from left to right. An important fact get-

ting evident from the displayed data is the presence of

big performance variations between the agents when

applied to certain targets. Further, the quality of the

agents themselves varies strongly for different tar-

gets. This indicates a form of specialization of certain

agents on speciﬁc targets.

For the state

⟩

, which is equivalent to the initial

state, all tested agents struggle to solve the task. It’s

essential to note that the environment doesn’t check if

the initial state matches the target initially. Hence, in

cases where the target is already reached from the be-

ginning, the agent must apply a gate that preserves the

state, such as S-, T-, or an identity gate on any qubit.

Despite these options comprising more than half of

the available action-space, they are rarely selected, re-

sulting in high n

-values. This behavior is explained

by the fact that agents are typically trained on targets

different from the initial state, requiring gates that

modify the current observation. Moreover, due to the

target initialization algorithm’s implementation,

⟩

is never a target during training when λ ̸= 0. Conse-

quently, agents are explicitly trained to avoid applying

gates that would be useful in this scenario, leading to

suboptimal performance on this speciﬁc task.

Figure 6: Performance of PPO-agents trained on circuit-

depths λ = 4,5 and 6 which are applied on the 9 states in-

cluded in the well-known state set given in Table 3.

The λ = 4 agents being trained on a comparatively

low circuit-depth λ show the best results for rela-

tively easy states like

(

⟩

) and

√

(

⟩

), while lacking in performance when

applied to the targets of a higher minimal circuit-

depth. On the other hand, agents trained with λ =

5 and 6 are able to perform better on the targets of a

higher minimal circuit-depth, showing strong perfor-

mance in general. For the state

√

(

⟩

−

⟩

) pos-

sessing the highest minimal circuit-depth, all agents

show low potential. To analyze the generated solu-

tions in more detail, the following Table 4 contains

the most promising quantum circuits created within

this benchmark test. We obtained these circuits by

identifying the λ ∈ {4, 5, 6}, which showed the low-

est n

for a particular target and subsequently deter-

mined the top-performing agent with the respective

λ. The most frequently generated circuit produced by

this agent was then extracted for the respective tar-

get state. Since the states

⟩

and

√

(

⟩

−

⟩

)

lead to a wide distribution of created circuits without

a clear favorite, they were excluded from this analy-

sis. The ’Generation probability’ describes the likeli-

hood of occurrence of the speciﬁc circuit in percent.

Considering the gates included in the displayed solu-

tions, while bearing in mind the minimal circuit-depth

(see Table 3), it becomes apparent that for all targets

a minimal circuit was found. Further, designs like the

circuit created for

√

(

⟩

) are known to liter-

ature as well (Schneider et al., 2022). The generation

probabilities of the results show that the agents strictly

specialized in the preparation of speciﬁc circuits, ob-

taining values of 96-100% for all but one state. Con-

clusively it can be stated that the selected agents per-

form sufﬁciently on real-world examples as included

in the set, creating highly optimized quantum circuits

which indeed prepare the desired targets.

7 CONCLUSION

In this work, we describe the successful implemen-

tation of a RL environment, enabling the training of

RL agents on the DQCS problem. Relying on the uti-

lization of gates from the Clifford+T gate set only, we

facilitate the direct transfer of the synthesized quan-

tum circuits onto real quantum devices. While ex-

ploring the parameter-space of the DQCS problem,

we demonstrated sufﬁcient task-solving capabilities

of the trained agents, regarding DQCS in a wide vari-

ety of settings. The tested parameter-space spanned

different targets including systems exhibiting qubit

counts of 2, 3, 4, 5, and 10, as well as circuits-depths

ranging from 1 to 15 used for target initialization.

Through the investigation of the agents’ behavior on

different DQCS systems, we discovered correlations

between the target state parameters, qubit-count and

circuit-depth, and the agents reconstructed circuit-

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

Table 4: States included in the 2-qubit benchmark set listed together with the most probable circuit designs created by the

best-ranked agents.

State Generated circuit Gen. probability [%]

(

⟩

)

100

√

(

⟩

)

100

⟩

S S

⟩

S S

√

(

⟩

−

⟩

)

⟩

S S

√

(

⟩

)

H H

depth. It got evident that the reconstructed circuit-

depth increases, if the qubit count or the circuit-depth

is raised. We found 2-qubit targets to be solvable by

the trained agents for a variety of different circuit-

depths. However, the preparation of targets charac-

terized by a qubit number > 2 still poses challenges.

Future efforts could further optimize hyperparam-

eters and reﬁne the agent’s training target-space to

boost RL network performance. We also consider

adopting a curriculum learning or a GAN approach,

with a generative network as the learner and a dis-

criminator for target setting. For quantum computers

using non-Clifford+T gate sets, we’ll integrate param-

eterized gates to devise advanced quantum circuits.

This, however, demands the RL agent to adapt further.

Beyond studying the DQCS issue, we’ve set bench-

marks comparing RL algorithms. Our trained PPO

agents performed on discrete tasks, revealing optimal

circuit designs. We plan to expand these benchmarks

for systems over 2 qubits using k-means clustering.

In summary, our ﬁndings demonstrate the appli-

cability and potential of reinforcement learning in ad-

dressing the DQCS problem, highlighting the need

for further research. Our approach represents a sig-

niﬁcant step towards fully automated quantum circuit

synthesis, showcasing the effectiveness of RL meth-

ods in tackling this challenge.

ACKNOWLEDGEMENTS

This work is part of the Munich Quantum Valley,

which is supported by the Bavarian state government

with funds from the Hightech Agenda Bayern Plus.

A Reinforcement Learning Environment for Directed Quantum Circuit Synthesis

REFERENCES

Altmann, P., B

arligea, A., Stein, J., K

olle, M., Gabor, T.,

Phan, T., and Linnhoff-Popien, C. (2023). Challenges

for reinforcement learning in quantum computing.

Barenco, A., Bennett, C. H., Cleve, R., DiVincenzo,

D. P., Margolus, N., Shor, P., Sleator, T., Smolin,

J., and Weinfurter, H. (1995). Elementary gates for

quantum computation. Phys.Rev. A52 (1995) 3457,

52(5):3457–3467.

Bennett, C. H. and Wiesner, S. J. (1992). Communica-

tion via one- and two-particle operators on einstein-

podolsky-rosen states. Phys. Rev. Lett., 69:2881–

2884.

Blazina, D., Duckett, S. B., Halstead, T. K., Kozak, C. M.,

Taylor, R. J. K., Anwar, M. S., Jones, J. A., and

Carteret, H. A. (2005). Generation and interrogation

of a pure nuclear spin state by parahydrogen-enhanced

nmr spectroscopy: a deﬁned initial state for quantum

computation. Magn. Reson. Chem., 43(3):200–208.

Cerezo, M., Arrasmith, A., Babbush, R., Benjamin, S.,

Endo, S., Fujii, K., McClean, J., Mitarai, K., Yuan,

X., Cincio, L., and Coles, P. J. (2021). Variational

quantum algorithms. Nat Rev Phys, 3:625–644.

Ekert, A. K. (1991). Quantum cryptography based on bell’s

theorem. Phys. Rev. Lett., 67(6).

Farhi, E. and Neven, H. (2018). Classiﬁcation with quan-

tum neural networks on near term processors. arXiv:

Quantum Physics.

osel, T., Niu, M. Y., Marquardt, F., and Li, L. (2021).

Quantum circuit optimization with deep reinforce-

ment learning.

Gabor, T., Zorn, M., and C., L.-P. (2022). The applicabil-

ity of reinforcement learning for the automatic gener-

ation of state preparation circuits. GECCO ’22: Pro-

ceedings of the Genetic and Evolutionary Computa-

tion Conference Companion, page 2196–2204.

Jozsa, R. (1994). Fidelity for mixed quantum states. Journal

of Modern Optics, 41(12):2315–2323.

Kaye, P., Laﬂamme, R., and M., M. (2007). An Introduction

to Quantum Computing. Oxford University Press.

Krenn, M., Kottmann, J. S., Tischler, N., and Aspuru-

Guzik, A. (2021). Conceptual understanding through

efﬁcient automated design of quantum optical experi-

ments. Phys. Rev. X, 11.

Krenn, M., Malik, M., Fickler, R., Lapkiewicz, R., and

Zeilinger, A. (2015). Automated search for new quan-

tum experiments. Phys. Rev. Lett. 116, 090405 (2016),

116(9):090405.

Li, Z., Peng, J., Mei, Y., Lin, S., Wu, Y., Padon, O., and Jia,

Z. (2023). Quarl: A learning-based quantum circuit

optimizer.

Mackeprang, J., Dasari, D. B. R., and Wrachtrup, J.

(2019). A reinforcement learning approach for quan-

tum state engineering. Quantum Machine Intelligence

2, 1(2020), 2(1).

Mansky, M. B., Castillo, S. L., Puigvert, V. R., and

Linnhoff-Popien, C. (2022). Near-optimal circuit con-

struction via cartan decomposition.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,

T. P., Harley, T., Silver, D., and Kavukcuoglu, K.

(2016). Asynchronous methods for deep reinforce-

ment learning. ICML 2016.

Nielsen, M. A. and Chuang, I. L. (2010). Quantum com-

putation and quantum information. 10th anniversary

edition. Cambridge: Cambridge University Press.

Ostaszewski, M., Trenkwalder, L. M., Masarczyk, W.,

Scerri, E., and Dunjko, V. (2021). Reinforcement

learning for optimization of variational quantum cir-

cuit architectures. Advances in Neural Information

Processing Systems (NeurIPS 2021), 34.

Parlett, B. (2000). The qr algorithm. Computing in Science

& Engineering, 2(1):38–42.

Peruzzo, A., McClean, J., Shadbolt, P., Yung, M.-H., Zhou,

X.-Q., Love, P. J., Aspuru-Guzik, A., and O’Brien,

J. L. (2013). A variational eigenvalue solver on a

quantum processor. Nature Communications, 5:4213,

(2014), 5(1).

Pirhooshyaran, M. and Terlaky, T. (2021). Quantum circuit

design search. Quantum Machine Intelligence, 3(25).

Puterman, M. L. (2014). Markov decision processes: dis-

crete stochastic dynamic programming. John Wiley &

Sons.

Schneider, S., Burgholzer, L., and Wille, R. (2022). A sat

encoding for optimal clifford circuit synthesis. ASP-

DAC ’23: Proceedings of the 28th Asia and South Pa-

ciﬁc Design Automation Conference, page 190–195.

Schuld, M., Bocharov, A., Svore, K., and Wiebe, N. (2020).

Circuit-centric quantum classiﬁers. Phys. Rev. A, 101.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and

Klimov, O. (2017). Proximal policy optimization

algorithms. International Conference on Machine

Learning (ICML).

oth, G. and Apellaniz, I. (2014). Quantum metrology from

a quantum information science perspective. J. Phys.

A: Math. Theor., 47(42).

Xu, M., Li, Z., Padon, O., Lin, S., Pointing, J., Hirth, A.,

Ma, H., Palsberg, J., Aiken, A., Acar, U. A., and

Jia, Z. (2022). Quartz: Superoptimization of quan-

tum circuits. Association for Computing Machinery,

43:625–640.

Zhang, X.-M., Wei, Z., Asad, R., Yang, X.-C., and Wang,

X. (2019). When does reinforcement learning stand

out in quantum control? a comparative study on state

preparation. npj Quantum Inf, 5(85).

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence