Optimizing 2D+1 Packing in Constrained Environments Using Deep

Reinforcement Learning

Victor Ulisses Pugliese

1 a

, Os

eias Faria de Arruda Ferreira

2 b

and Fabio A. Faria

1,3 c

Universidade Federal de S

ao Paulo, S

ao Jos

e dos Campos, S

ao Paulo, Brazil

EMBRAER S.A., S

ao Jos

e dos Campos, S

ao Paulo, Brazil

Instituto Superior Tecnico, Universidade de Lisboa, Av. Rovisco Pais 1, Lisboa, Portugal

Keywords:

Deep Reinforcement Learning, Packing, PPO, A2C.

Abstract:

This paper proposes a novel approach based on deep reinforcement learning (DRL) for the 2D+1 packing prob-

lem with spatial constraints. This problem is an extension of the traditional 2D packing problem, incorporating

an additional constraint on the height dimension. Therefore, a simulator using the OpenAI Gym framework

has been developed to efﬁciently simulate the packing of rectangular pieces onto two boards with height con-

straints. Furthermore, the simulator supports multidiscrete actions, enabling the selection of a position on

either board and the type of piece to place. Finally, two DRL-based methods (Proximal Policy Optimization –

PPO and the Advantage Actor-Critic – A2C) have been employed to learn a packing strategy and demonstrate

its performance compared to a well-known heuristic baseline (MaxRect-BL). In the experiments carried out,

the PPO-based approach proved to be a good solution for solving complex packaging problems and highlighted

its potential to optimize resource utilization in various industrial applications, such as the manufacturing of

aerospace composites.

1 INTRODUCTION

Manufacturing has undergone signiﬁcant changes in

recent decades, primarily driven by market trends that

encourage companies to transition from traditional

mass production lines to more dynamic and ﬂexible

manufacturing systems, essential for competitiveness

in the global market. This shift, known as smart man-

ufacturing, is currently reinventing itself through ad-

vances in Digital Transformation, Internet of Things

(IoT), and Artiﬁcial Intelligence (AI) (Alem

ao et al.,

2021), (Xia et al., 2021), and (Ramezankhani et al.,

2021).

Consequently, various approaches to manufactur-

ing scheduling have been studied and implemented to

optimize production and resource allocation. Despite

these efforts, most scheduling uses manual methods

or basic software, resulting in limited improvements

in system performance. Historically, the production

lines produced many of the same products, always

following the same process. However, this is not the

https://orcid.org/0000-0001-8033-6679

https://orcid.org/0009-0007-4066-8709

https://orcid.org/0000-0003-2956-6326

case for Smart Manufacturing (Alem

ao et al., 2021).

Aerospace manufacturing, particularly using com-

posite materials, presents a complex scheduling chal-

lenge characterized by high demand variability, ex-

tended lead times, and the integration of diverse sup-

pliers and work practices. Although composites of-

fer advantages such as superior strength, corrosion

resistance, and efﬁcient forming, their higher cost

than traditional metallic materials requires careful op-

timization (Xie et al., 2020) and (Azami et al., 2018).

The manufacturing process typically involves two pri-

mary stages: layup and curing (Azami, 2016). Au-

toclave packing, a critical aspect of the curing pro-

cess, involves meticulous placement of composite

parts within the autoclave to achieve desired prod-

uct properties (Haskilic et al., 2023) and (Elkington

et al., 2015). This intricate task, involving manual po-

sitioning, presents a unique optimization problem that

surpasses the classical packing problem due to addi-

tional constraints and resource management require-

ments (Collart, 2015).

Certain constraints can be relaxed to simplify the

optimization process. For instance, since composite

materials cannot be stacked within an autoclave, the

placement strategy can focus on the width and length

Pugliese, V. U., Ferreira, O. F. A. and Faria, F. A.

Optimizing 2D+1 Packing in Constrained Environments Using Deep Reinforcement Learning.

DOI: 10.5220/0013292100003929

In Proceedings of the 27th Inter national Conference on Enterprise Information Systems (ICEIS 2025) - Volume 1, pages 501-511

ISBN: 978-989-758-749-8; ISSN: 2184-4992

501

of the parts. Additionally, the height of each part must

be veriﬁed to ensure it does not exceed the capacity of

the tooling cart.

The introduction of Reinforcement Learning (RL)

methods to solve packing problems has shown

promising results in the literature. For instance,

(Kundu et al., 2019) employed RL to take an image

as input and predict the pixel position of the next

box, while (Li et al., 2022) explored RL in 2D and

3D environments. Furthermore, combining heuristics

with RL, as in (Fang et al., 2023a), has proven to

be effective, and RL has also been applied to sev-

eral other types of problem, as discussed in (Wang

et al., 2022). One of the advantages of RL is that

it does not require an explicit model of the environ-

ment; the agent learns to make decisions by observ-

ing the rewards of its actions from a state, as described

in (Sutton and Barto, 2018), and continuously adapts

to its environment through exploration and exploita-

tion. This makes RL particularly suitable for sequen-

tial decision-making in games, robotics, control sys-

tems, and scheduling problems (Cheng et al., 2021).

Our approach distinguishes itself by relying solely

on RL methods, using actor-critic to explore and ex-

ploit. This contrasts with other packing studies that

frequently incorporate heuristics to guide or direct the

RL algorithm, thereby limiting its scope and creativ-

ity. To our knowledge, no scientiﬁc study has ever

addressed this topic in the literature. Therefore, this

paper aims to apply Reinforcement Learning meth-

ods to address a 2D+1 packing problem with spatial

constraints. This problem is an extension of the tra-

ditional 2D packing problem, incorporating an addi-

tional constraint on the height dimension. We also

compare the PPO and A2C as the unique methods that

support multi-discrete action spaces. This research,

inspired by the challenges of aerospace composite

manufacturing, has potential applications in many in-

dustry sectors, including the packing of components

in vehicles, organizing parts in boxes or pallets for

transport and storage, arranging products in-store dis-

plays, and similar optimization tasks across different

sectors.

2 BACKGROUND

This section brieﬂy describes the types of packing

problem and the deep reinforcement learning (DRL)

methods used in this paper.

2.1 Packing

The packing problem is a classic challenge in combi-

natorial optimization that has been extensively stud-

ied for decades by researchers in operations research

and computer science, as noted in (Li et al., 2022).

The primary objective is to allocate objects within

containers, minimizing wasted space efﬁciently. The

problem can work with regular (Kundu et al., 2019)

and (Zhao et al., 2022b) or irregular shapes

(Crescitelli and Oshima, 2023), often explored in

streaming/online or batching/ofﬂine approaches.

Several works based on heuristic approaches have

been proposed for solving packing problems as de-

scribed in (Oliveira et al., 2016), such as the Max-

imum Rectangles - Bottom-Left (Max Rects-BL),

Best-Fit Decreasing Height (BFDH), and Next-Fit

Decreasing Height (NFDH). Max Rects-BL approach

places the largest rectangle in the nearest available

bottom-left corner of a 2D space (Fang et al., 2023a).

BFDH sorts items by descending height and then at-

tempts to place each item, left-justiﬁed, on the ex-

isting level with the minimum remaining horizon-

tal space (Seizinger, 2018). In the NFDH approach,

it ﬁrst arranges the pieces in descending order of

heigthen places each piece on the current level, start-

ing from the left side, as long as there is enough space;

otherwise, it starts a new level (Oliveira et al., 2016).

2.2 Deep Reinforcement Learning

Deep Reinforcement Learning (DRL) addresses the

challenge of autonomously learning optimal decisions

over time. Although it employs well-established su-

pervised learning methods, such as deep neural net-

works for function approximation, stochastic gradi-

ent descent (SGD), and backpropagation, RL applies

these techniques differently, without a supervisor, us-

ing a reward signal and delayed feedback. In this

context, an RL agent receives dynamic states from an

environment and takes actions to maximize rewards

through trial-and-error interactions (Kaelbling et al.,

1996).

The agent and the environment interact in a se-

quence at each discrete time step, t = 0, 1, 2, 3, · · · . At

each time step t, the agent receives a representation

of the environment’s state s

∈ S, where S is the set of

possible states, and selects an action a

∈ A(s

), where

A(s

) is the set of actions available in the state s

. At

time step t + 1, as a consequence of its actions, the

agent receives a numerical reward r

t+1

∈ R and tran-

sitions to a new state s

t+1

(Sutton and Barto, 2018).

During each iteration, the agent implements a

mapping from states to the probabilities of each possi-

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

502

ble action. This mapping, known as the agent’s policy,

is denoted as π

, where π

(s, a) represents the proba-

bility that a

= a given s

= s. Reinforcement learn-

ing methods specify how the agent updates its policy

based on experience, intending to maximize the cu-

mulative reward over the long term, according (Sut-

ton and Barto, 2018).

2.2.1 Proximal Policy Optimization (PPO)

PPO employs the actor-critic method and trains on-

policy, meaning it samples actions based on the most

recent policy iteration (Schulman et al., 2017). In

this framework, two neural networks typically serve

as the “actor” and “critic.” The “actor” learns the pol-

icy, while the “critic” estimates the value function or

the advantage, which is used to train the “actor”.

The training process involves calculating future

rewards and advantage estimates to reﬁne the policy

and adjust the value function. Both the policy and

value function are optimized using stochastic gradi-

ent descent algorithms, as described in (Keras, 2022).

The degree of randomness in action selection de-

pends on the initial conditions and the training pro-

cedure. Typically, as training progresses, the pol-

icy becomes less random due to updates that encour-

age the exploration of previously discovered rewards

aenz Imbacu

an, 2020).

2.2.2 Advantage Actor-Critic (A2C)

A2C, often perceived as a distinct algorithm, is re-

vealed in “A2C is a special case of PPO” as a speciﬁc

conﬁguration of Proximal Policy Optimization (PPO)

operating within the actor-critic approach. A2C

shares similarities with PPO in employing separate

neural networks for policy selection (actor) and value

estimation (critic). Its core objective aligns with PPO

when the latter’s update epochs are set to 1, effec-

tively removing the clipping mechanism and stream-

lining the learning process (Huang et al., 2022).

A2C is a synchronous adaptation of the Asyn-

chronous Actor-Critic (A3C) policy gradient ap-

proach. It operates deterministically, waiting for ev-

ery actor to complete its experience segment before

initiating updates, averaging across all actors. This

strategy improves GPU utilization by accommodating

larger batch sizes (Mnih et al., 2016).

3 RELATED WORKS

The ﬁeld of 2D regular packing problems has seen

signiﬁcant progress in recent years, with various

methods proposed to optimize space utilization and

minimize waste, using Reinforcement Learning. This

review connects several key research papers, high-

lighting the diverse strategies to tackle these chal-

lenges.

In online 2D bin packing, where items are placed

sequentially without prior knowledge of future in-

puts, (Kundu et al., 2019) propose a variation of DQN

for the 2D online bin packing problem, to maximize

packing density. This method takes an image of the

current bin state as input and determines the precise

location for the next object placement. The reward

function encourages placing objects in a way that

maximizes space for future placements. The method

is extendable to 3D online bin-packing problems.

For grouped 2D bin packing, common in indus-

tries like furniture manufacturing and glass cutting,

where orders are divided into groups and optimized

within each group, (Ao et al., 2023) presents a hierar-

chical reinforcement learning approach. The method

was successfully developed in a Chinese factory, re-

ducing the raw material costs. (Li et al., 2022) pro-

poses SAC with a recurrent attention encoder to cap-

ture inter-box dependencies and a conditional query

decoder for reasoning about subsequent actions in 2D

and 3D packing problems. This approach demon-

strates superior space utilization compared to base-

lines, especially in ofﬂine and online strip packing

scenarios.

To address uncertainties in real-world packing

problems, (Zhang et al., 2022) presents a hybrid

heuristic algorithm that combines enhanced scoring

rules with a DQN, which dynamically selects heuris-

tics through a data-driven process, to solve the truck

routing and online 2D strip packing problem.

We can mention other works which combine RL

with scoring rules. (Zhao et al., 2022b), for in-

stance, employed Q-learning for sequencing and the

bottom-left centroid rule for positioning. Fang et

al. (Fang et al., 2023a) leveraged REINFORCE

with the MaxRect-BL algorithm to exploit under-

lying packing patterns. It (Zhu et al., 2020) Re-

inforcement Learning-based Simple Random Algo-

rithm (RSRA) algorithm, integrating skyline-based

scoring rules with a DQN, has demonstrated effec-

tiveness.

This section shows a range of RL methods ap-

plied to 2D regular packing problems. As research

in this area advances, there is also an increasing focus

on expanding 3D solutions (Wu and Yao, 2021; Zhao

et al., 2022a; Puche and Lee, 2022; Zuo et al., 2022)

and tackling irregular shapes (Crescitelli and Oshima,

2023; Fang et al., 2023b; Fang et al., 2022; Fang et al.,

2021; Yang et al., 2023).

Optimizing 2D+1 Packing in Constrained Environments Using Deep Reinforcement Learning

503

4 DRL APPROACH FOR 2D+1

PACKING PROBLEM

This section describes our DRL solution for a 2D+1

packing environment, inspired by real-world scenar-

ios related to aerospace composite manufacturing.

The environment simulates the task of efﬁciently

packing rectangular pieces onto two distinct boards

with limited height. It was built using the OpenAI

Gymnasium framework and represents the packing

scenario with the following key components:

1. Observation Space. It consists of two matrices,

each one representing a board of length X width

dimensions. Additionally, four integer values are

included, corresponding to the quantities of four

different types of piece.

2. Action Space. It comprises a multi-discrete

space, encompassing the (x, y) coordinates for the

top-left corner of a piece placement, an index se-

lecting the target board, and another index spec-

ifying the piece to be chosen from the available

set.

3. Algorithm. It is structured to reward the agent for

positive actions that effectively ﬁll the available

spaces in the environment. Conversely, penalties

are applied for invalid actions, such as selecting a

piece with zero remaining quantity, attempting to

place a piece on an already occupied coordinate,

or putting a piece that exceeds the tooling cart’s

height. The process proceeds in Algorithm 1.

The agent and our simulator interact during

each episode in a discrete-time sequence, t =

0, 1, 2, 3, · · · . At each time step t, the agent is pro-

vided with a representation of the boards and the

quantities of pieces to be placed, s

∈ S, where

S represents the set of available positions on the

board and the piece’s type. The action taken by

the agent, denoted as a

∈ A(s

), consists of se-

lecting the coordinates (x, y), the index board, and

the index piece in-state s

for placement. At time

step t + 1, as a result of this action, the agent re-

ceives a numerical reward r

t+1

∈ R and transitions

to a new state s

t+1

, as shown in Figure 1.

The R

height

is determined by the following condi-

tions:

• If

piece height

board height

× 100 ≤ 50, then R

height

= 0

• Else if

piece height

board height

× 100 ≤ 80, then R

height

= 1

• Else if

piece height

board height

× 100 ≤ 100, then R

height

= 2

(Optimal)

• Else if

piece height

board height

× 100 > 100, then R

height

−2

Algorithm 1: Packing2D Environment - Step Function.

Data: Action a = [x, y, b, p], where:

• (x, y): Placement coordinates on the board

• b: Board index (0 or 1)

• p: Piece type index (0 to 3)

Result: Observation s

′

, Reward r, Done ﬂag

, B

← Current states of boards and height

maps

, Q

← Remaining quantities of piece

types 0, 1, 2, 3

empty ← count zeros(B

) + count zeros(B

)

if Q

, Q

> 0 and empty > 0 then

piece ← Shape matrix of piece type p

x ← clip(x, 0, board weight − w)

y ← clip(y, 0, board lenght − l)

occupied ← 0

for i = 0 to l − 1 do

for j = 0 to w − 1 do

if piece[i][ j] == 1 then

if x + i < board lenght and y + j <

board weight then

(x+i, y+ j) ← B

(x+i, y+ j)+

(x + i, y + j) ← B

(x + i, y +

j) + piece height(p)

if B

(x + i, y + j) > 1 then

occupied ← 1

end

height

← check height(b, piece height(p))

if occupied == 0 and r

height

≥ 0 then

dim = (w, l) ← Dimensions of piece type p

r ← dim × r

height

else

r ← −8

Revert B

and B

to previous state

end

done ← False

← Q

− 1

else

done ← True

r ←

calculate

reward(B

)+calculate reward(B

)

end

return (B

, B

, Q

), r, done

4. Training and Testing. We only employed PPO

and A2C methods in the Stable Baselines library,

because they support multi-discrete action spaces.

PPO, a state-of-the-art model-free reinforcement

learning algorithm (Sun et al., 2019), is partic-

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

504

Figure 1: Our simulator pipeline based on reinforcement

learning methods.

ularly effective in this context. Standard imple-

mentations of Deep Q-Network (DQN) and Soft

Actor-Critic (SAC) are not directly applicable to

multi-discrete action spaces.

Our agents were trained in a 2D+1 packing en-

vironment for 10 million episodes. Evaluations

were conducted every 50 episodes under deter-

ministic conditions, and each experiment was re-

peated 10 times. For both PPO and A2C, we used

a linear learning rate of 0.0005 and a discount fac-

tor (gamma) of 0.95 as hyperparameters.

5 EXPERIMENTS

This section presents the experimental protocol and

results achieved by DRL methods.

5.1 Experimental Methodology

In this work, we conducted six different experiments:

three of them using even board dimensions (8×8) and

three using odd board dimensions (7 × 7). Each set of

experiments followed these conditions: (1) the pieces

and boards were constrained to a uniform height, (2)

the board 1 was taller than the board 2, and (3) the

board 2 was taller than the board 1. Table 1 summa-

rizes the setup adopted in the experiments.

Table 1: Setup of the experiments.

Exp. ID Even/Odd Weight Length Height 0 Height 1

1 Even 8 8 100 100

2 Even 8 8 120 80

3 Even 8 8 80 120

4 Odd 7 7 100 100

5 Odd 7 7 120 80

6 Odd 7 7 80 120

We also employed four types of pieces: 2 × 2 and

2 × 1, with heights of either 115 or 75 centimeters.

These pieces are placed within two boards that deﬁne

the environment’s boundaries. Figure 2 shows their

shapes.

Figure 2: The types of piece available to be placed into the

boards.

Table 2 shows the amount of pieces was used for

each experiment.

Table 2: The sets of pieces used for each experiment.

Exp. ID Qty Piece 1 Qty Piece 2 Qty Piece 3 Qty Piece 4

1 8 8 16 16

2 8 8 16 16

3 8 8 16 16

4 6 6 9 9

5 6 6 9 9

6 6 6 9 9

In a real-world aerospace manufacturing setting,

the number of parts in an autoclave can vary signiﬁ-

cantly based on available volume and batch size. We

can expect around 30 to 50 parts per curing cycle.

However, the number may be lower, such as when

dealing with aircraft fairings. It is important to note

that the packing phase does not involve irregular parts

due to the safety margins necessary to achieve the de-

sired product properties. Furthermore, since it oper-

ates on batches of parts, we should abstract these con-

straints, focusing on the packing process.

In the experiments carried out, the PPO, A2C and

MaxRect-BL methods have been compared. We se-

lected the MaxRect-BL approach, aligning with the

bottom-left placement strategies employed in prior

work by (Zhao et al., 2022b) and (Fang et al., 2023a)

within the context of RL. To address the limitations of

MaxRect-BL in handling height constraints, we im-

plemented a modiﬁed version inspired by the BFDH.

This modiﬁed approach prioritizes the height orienta-

tion of the pieces before considering their size during

the packing process, effectively improving the pack-

ing efﬁciency. Furthermore, the simulations were per-

Optimizing 2D+1 Packing in Constrained Environments Using Deep Reinforcement Learning

505

formed on a Core i7 processor with 16 GB of RAM.

Each PPO training session lasted approximately 7

hours.

5.2 Even Experiments

Figure 3 illustrates the evaluation curves for 10 inde-

pendent PPO runs across the three experimental con-

ditions. These experiments demonstrated optimal per-

formance by achieving the maximum reward through

100% correct board ﬁllings. Negative reward values

signify incorrect board conﬁgurations.

Figure 3: Evaluation curves of the three even experiments

using PPO.

Figure 4 presents the curves depicting the mean

episode length across 10 independent PPO runs un-

der the three experimental conditions. These re-

sults highlight the progression of the mean episode

length throughout iterations, providing insights into

the agent’s performance dynamics and its efﬁciency

in solving the problem. The optimal episode length

occurs approximately when the maximum number of

pieces is successfully packed onto a board.

Figure 4: Mean episode length over even experiments using

PPO.

In the experiments carried out, the A2C method

proved to be less practical than the PPO due to its sig-

niﬁcantly longer convergence time and greater insta-

bility during training. Furthermore, A2C often fails

to achieve optimal performance compared to PPO

method. We selected 4 results from each experimental

setup to compare these RL methods. Table 3 contains

the mean percentage of correct board ﬁlls (Mean) and

its standard deviation (Std).

Table 3: Comparative analysis between PPO and A2C

methods for correctly board ﬁlling.

Experiment

PPO A2C

Mean Std Mean std

1 96.0% 3.0% 88.0% 6.0%

2 96.0% 5.0% 34.0% 23.0%

3 94.0% 5.0% 74.0% 5.0%

5.2.1 Experiment 1

All of pieces and boards were constrained to a uni-

form height for this experiment. Both the PPO and

MaxRect-Bl algorithms achieved complete coverage

(100%) of the boards, as demonstrated in Figure 5.

The green regions highlight the optimal placements

determined by the algorithms during the packing pro-

cess.

Figure 5: Experiment 1 - All of pieces and boards were

constrained to a uniform height for this experiment.

5.2.2 Experiment 2

Both PPO and MaxRect-BL achieved 100% cover-

age. However, MaxRect-BL’s optimal performance

was contingent on a speciﬁc piece sorting strategy:

ﬁrst by descending height, then by descending dimen-

sions. The BFDH heuristic could also achieve optimal

performance. Figure 6 shows the convergence behav-

ior of the experiment.

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

506

Figure 6: Experiment 2 - board 1 taller than board 2.

Without sorting by height, the MaxRect-BL fails

to converge as effectively, as indicated in Figure 7.

This occurs because MaxRect-BL initially places 8

Pieces 1 (with R

height

= 2) and 8 Pieces 2 (with

height

= 1), which ﬁll all the available space on

board 0. It then attempts to add 16 Pieces 3 (whose

height exceeds the board’s height, causing them to be

skipped) and ﬁnally places 16 Pieces 4 (with R

height

2) on board 1. The yellow areas highlight suboptimal

placement choices resulting from R

height

< 2, while

the white areas represent unused space on the board.

Figure 7: Experiment 2 without order pieces for MaxRect-

BL.

5.2.3 Experiment 3

In this experiment, both PPO and MaxRect-BL

achieved 100% coverage. While MaxRect-BL re-

quired a speciﬁc piece sorting strategy (ascending

height, descending dimensions) for optimal perfor-

mance, the BFDH heuristic could also achieve opti-

mal results in this scenario. Figure 8 illustrates the

convergence behavior of the experiment.

Without sorting by height, the MaxRect-BL fails

to converge as effectively, as indicated in Figure 9.

Figure 8: Experiment 3 - board 2 taller than board 1.

This occurs because MaxRect-BL ﬁrst attempts to

place 8 Pieces 1 (whose height exceeds the board’s

limit, causing them to be skipped), but successfully

adds 8 Pieces 2 (with R

height

= 2). It then tries to

add 16 Pieces 3 (again skipped due to their excessive

height) and ﬁnally places 16 Pieces 4 (which meet the

optimal condition of R

height

= 2) on board 0.

Figure 9: Experiment 3 without order pieces for MaxRect-

BL.

5.3 Odd Boards

Figure 10 illustrates the evaluation curves for 10 inde-

pendent PPO runs across the three experimental con-

ditions. These experiments demonstrated optimal per-

formance by achieving the maximum reward through

100% correct board ﬁllings. Negative reward values

signify incorrect board conﬁgurations.

Figure 11 presents the curves depicting the mean

episode length across 10 independent PPO runs un-

der the three experimental conditions. These re-

sults highlight the progression of the mean episode

length throughout iterations, providing insights into

the agent’s performance dynamics and its efﬁciency

in solving the problem. The optimal episode length

Optimizing 2D+1 Packing in Constrained Environments Using Deep Reinforcement Learning

507

Figure 10: Evaluation curves of the three odd experiments

using PPO.

Figure 11: Mean episode length over odd experiments using

PPO.

occurs approximately when the maximum number of

pieces is successfully packed onto a board.

In these experiments, it was again possible to ob-

serve that A2C method achieved an inferior perfor-

mance when compared to PPO method. We selected

4 results from each experimental condition to com-

pare these RL methods. Table 4 contains the mean

percentage of correct board ﬁlls (Mean) and its stan-

dard deviation (Std).

Table 4: Comparative analysis between PPO and A2C

methods for correct board ﬁlls.

experiment

PPO A2C

Mean Std Mean Std

4 97.0% 3.0% 82.0% 8.0%

5 97.0% 3.0% 88.0% 4.0%

6 97.0% 4.0% 81.0% 8.0%

5.3.1 Experiment 4

All of pieces and boards were constrained to a uni-

form height for this experiment.

Figure 12: Experiment 4 - All of pieces and boards were

constrained to a uniform height for this experiment.

Both the PPO and MaxRect-Bl algorithms

achieved complete coverage (100%) of the boards, as

demonstrated in Figure 12. The green regions high-

light the optimal placements determined by the algo-

rithms during the packing process.

5.3.2 Experiment 5

PPO and MaxRect-Bl (ordered by height descending)

successfully placed all pieces. MaxRect-BL exhibits

the same limitations as in Experiment 2 without this

ordering. Figure 13 shows the convergence behavior

of the experiment.

Figure 13: Experiment 5 - board 1 taller than board 2.

5.3.3 Experiment 6

PPO and MaxRect-Bl (ordered by height ascending)

successfully placed all pieces. MaxRect-BL exhibits

the same limitations as in Experiment 3 without this

ordering. Figure 14 shows the convergence behavior

of the experiment.

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

508

Figure 14: Experiment 6 - board 2 taller than board 1.

6 CONCLUSIONS

This paper proposed a 2D+1 simulator, and devel-

oped a spatially constrained packing problem within

the OpenAI Gymnasium framework, for the pack-

ing problem in the ofﬂine approach. This simulator

employs an observation space comprising two boards

and four different types of pieces and their associated

quantities. It supports multi-discrete action space, al-

lowing the selection of a position on a speciﬁc board

and the choice of a piece to place. Furthermore, this

paper introduced a new spatial-variant reward func-

tion that maximizes coverage by considering both di-

mension and height of the pieces.

This research conducted a literature review fo-

cused on deep reinforcement learning solutions for

the 2D regular packing problem. Since 2018, pub-

lications involving DRL for this type of problem have

attracted the attention of researchers; however, there

are still research gaps, such as the use of on-policy

actor-critic methods for the target task.

In the performed experiments, it was possible to

observe that PPO and MaxRect-BL (with height or-

dering) have correctly allocated all of the pieces.

However, MaxRect-BL without height ordering ex-

hibited poorer performance, as illustrated in Figures

7 and 9. As the problem complexity increases (e.g.,

multiple boards), the effectiveness of simple heuris-

tics like height-based ordering diminishes. While the

BFDH heuristic is viable for packing items, PPO’s

ability to learn and adapt dynamically through explo-

ration and exploitation provides a more ﬂexible and

potentially superior solution. The A2C did not show

better results than PPO in the experiments.

As future work, to enhance the simulator’s ﬁdelity

as a digital representation of an aerospace industry au-

toclave for composite material curing, we plan to im-

plement key improvements, including material alloca-

tion constraints to ensure accurate material placement

based on speciﬁc curing types, thus reﬂecting real-

world production processes. Additionally, we will

integrate thermocouple and pressure sensor simula-

tions to capture precise temperature and pressure con-

ditions within the autoclave, providing valuable data

for process optimization and quality control. Further-

more, a mechanism will be added to simulate material

delivery deadlines, ensuring the simulator reﬂects the

time-sensitive nature of production operations. These

enhancements will result in a more comprehensive

and realistic model of the autoclave curing process,

enabling engineers to conduct more effective simula-

tions and optimize production workﬂows.

ACKNOWLEDGEMENTS

The authors would like to thank the National Coun-

cil for Scientiﬁc and Technological Development

(CNPq) for granting a scholarship to Victor Pugliese

through the Academic Master’s and Doctorate Pro-

gram in Innovation (MAI/DAI) in collaboration with

the EMBRAER S.A. company.

REFERENCES

Alem

ao, D., Rocha, A. D., and Barata, J. (2021). Smart

manufacturing scheduling approaches—systematic

review and future directions. Applied Sciences,

11(5):2186.

Ao, W., Zhang, G., Li, Y., and Jin, D. (2023). Learning to

solve grouped 2d bin packing problems in the manu-

facturing industry. In Proceedings of the 29th ACM

SIGKDD Conference on Knowledge Discovery and

Data Mining, pages 3713–3723.

Azami, A. (2016). Scheduling Hybrid Flow Lines of

Aerospace Composite Manufacturing Systems. PhD

thesis, Concordia University.

Azami, A., Demirli, K., and Bhuiyan, N. (2018). Schedul-

ing in aerospace composite manufacturing systems: a

two-stage hybrid ﬂow shop problem. The Interna-

tional Journal of Advanced Manufacturing Technol-

ogy, 95:3259–3274.

Cheng, C.-A., Kolobov, A., and Swaminathan, A. (2021).

Heuristic-guided reinforcement learning. Advances

in Neural Information Processing Systems, 34:13550–

13563.

Collart, A. (2015). An application of mathematical op-

timization to autoclave packing and scheduling in a

composites manufacturing facility.

Crescitelli, V. and Oshima, T. (2023). A deep reinforce-

ment learning method for 2d irregular packing with

dense reward. In 2023 Fifth International Confer-

ence on Transdisciplinary AI (TransAI), pages 270–

271. IEEE.

Optimizing 2D+1 Packing in Constrained Environments Using Deep Reinforcement Learning

509

Elkington, M., Bloom, D., Ward, C., Chatzimichali, A.,

and Potter, K. (2015). Hand layup: understanding the

manual process. Advanced manufacturing: polymer

& composites science, 1(3):138–151.

Fang, J., Rao, Y., Ding, W., and Meng, R. (2022). Research

on two-dimensional intelligent nesting based on sarsa-

learning. In 2022 5th International Conference on Ad-

vanced Electronic Materials, Computers and Software

Engineering (AEMCSE), pages 826–829. IEEE.

Fang, J., Rao, Y., Guo, X., and Zhao, X. (2021). A rein-

forcement learning algorithm for two-dimensional ir-

regular packing problems. In Proceedings of the 2021

4th International Conference on Algorithms, Comput-

ing and Artiﬁcial Intelligence, pages 1–6.

Fang, J., Rao, Y., and Shi, M. (2023a). A deep reinforce-

ment learning algorithm for the rectangular strip pack-

ing problem. Plos one, 18(3):e0282598.

Fang, J., Rao, Y., Zhao, X., and Du, B. (2023b). A hy-

brid reinforcement learning algorithm for 2d irregular

packing problems. Mathematics, 11(2):327.

Haskilic, V., Ulucan, A., Atici, K. B., and Sarac, S. B.

(2023). A real-world case of autoclave loading and

scheduling problems in aerospace composite material

production. Omega, page 102918.

Huang, S., Kanervisto, A., Rafﬁn, A., Wang, W., Onta

on,

S., and Dossa, R. F. J. (2022). A2c is a special case of

ppo. arXiv preprint arXiv:2205.09123.

Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996).

Reinforcement learning: A survey. Journal of artiﬁ-

cial intelligence research, 4:237–285.

Keras, F. (2022). PPO proximal policy optimization.

Kundu, O., Dutta, S., and Kumar, S. (2019). Deep-pack:

A vision-based 2d online bin packing algorithm with

deep reinforcement learning. In 2019 28th IEEE Inter-

national Conference on Robot and Human Interactive

Communication (RO-MAN), pages 1–7. IEEE.

Li, D., Gu, Z., Wang, Y., Ren, C., and Lau, F. C. (2022).

One model packs thousands of items with recurrent

conditional query learning. Knowledge-Based Sys-

tems, 235:107683.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T.,

Harley, T., Silver, D., and Kavukcuoglu, K. (2016).

Asynchronous methods for deep reinforcement learn-

ing. In International conference on machine learning,

pages 1928–1937. PMLR.

Oliveira, J. F., Neuenfeldt, A., Silva, E., and Carrav-

illa, M. A. (2016). A survey on heuristics for the

two-dimensional rectangular strip packing problem.

Pesquisa Operacional, 36(2):197–226.

Puche, A. V. and Lee, S. (2022). Online 3d bin packing

reinforcement learning solution with buffer. In 2022

ieee/rsj international conference on intelligent robots

and systems (iros), pages 8902–8909. IEEE.

Ramezankhani, M., Crawford, B., Narayan, A., Voggenre-

iter, H., Seethaler, R., and Milani, A. S. (2021). Mak-

ing costly manufacturing smart with transfer learning

under limited data: A case study on composites auto-

clave processing. Journal of Manufacturing Systems,

59:345–354.

aenz Imbacu

an, R. (2020). Evaluating the impact of cur-

riculum learning on the training process for an intelli-

gent agent in a video game.

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

510

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and

Klimov, O. (2017). Proximal policy optimization al-

gorithms. arXiv preprint arXiv:1707.06347.

Seizinger, M. (2018). The two dimensional bin packing

problem with side constraints. In Operations Research

Proceedings 2017: Selected Papers of the Annual In-

ternational Conference of the German Operations Re-

search Society (GOR), Freie Universi

at Berlin, Ger-

many, September 6-8, 2017, pages 45–50. Springer.

Sun, Y., Yuan, X., Liu, W., and Sun, C. (2019). Model-

based reinforcement learning via proximal policy op-

timization. In 2019 Chinese Automation Congress

(CAC), pages 4736–4740. IEEE.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learn-

ing: An introduction. MIT press.

Wang, X., Thomas, J. D., Piechocki, R. J., Kapoor, S.,

Santos-Rodr

ıguez, R., and Parekh, A. (2022). Self-

play learning strategies for resource assignment in

open-ran networks. Computer Networks, 206:108682.

Wu, Y. and Yao, L. (2021). Research on the problem of 3d

bin packing under incomplete information based on

deep reinforcement learning. In 2021 International

conference on e-commerce and e-management (ICE-

CEM), pages 38–42. IEEE.

Xia, K., Sacco, C., Kirkpatrick, M., Saidy, C., Nguyen,

L., Kircaliali, A., and Harik, R. (2021). A digital

twin to train deep reinforcement learning agent for

smart manufacturing plants: Environment, interfaces

and intelligence. Journal of Manufacturing Systems,

58:210–230.

Xie, N., Zheng, S., and Wu, Q. (2020). Two-dimensional

packing algorithm for autoclave molding scheduling

of aeronautical composite materials production. Com-

puters & Industrial Engineering, 146:106599.

Yang, Z., Pan, Z., Li, M., Wu, K., and Gao, X. (2023).

Learning based 2d irregular shape packing. ACM

Transactions on Graphics (TOG), 42(6):1–16.

Zhang, Y., Bai, R., Qu, R., Tu, C., and Jin, J. (2022). A deep

reinforcement learning based hyper-heuristic for com-

binatorial optimisation with uncertainties. European

Journal of Operational Research, 300(2):418–427.

Zhao, H., Zhu, C., Xu, X., Huang, H., and Xu, K. (2022a).

Learning practically feasible policies for online 3d

bin packing. Science China Information Sciences,

65(1):112105.

Zhao, X., Rao, Y., and Fang, J. (2022b). A reinforcement

learning algorithm for the 2d-rectangular strip packing

problem. In Journal of Physics: Conference Series,

volume 2181, page 012002. IOP Publishing.

Zhu, K., Ji, N., and Li, X. D. (2020). Hybrid heuristic

algorithm based on improved rules & reinforcement

learning for 2d strip packing problem. IEEE Access,

8:226784–226796.

Zuo, Q., Liu, X., Xu, L., Xiao, L., Xu, C., Liu, J., and Chan,

W. K. V. (2022). The three-dimensional bin pack-

ing problem for deformable items. In 2022 IEEE In-

ternational Conference on Industrial Engineering and

Engineering Management (IEEM), pages 0911–0918.

IEEE.

Optimizing 2D+1 Packing in Constrained Environments Using Deep Reinforcement Learning

511