
of the parts. Additionally, the height of each part must
be verified to ensure it does not exceed the capacity of
the tooling cart.
The introduction of Reinforcement Learning (RL)
methods to solve packing problems has shown
promising results in the literature. For instance,
(Kundu et al., 2019) employed RL to take an image
as input and predict the pixel position of the next
box, while (Li et al., 2022) explored RL in 2D and
3D environments. Furthermore, combining heuristics
with RL, as in (Fang et al., 2023a), has proven to
be effective, and RL has also been applied to sev-
eral other types of problem, as discussed in (Wang
et al., 2022). One of the advantages of RL is that
it does not require an explicit model of the environ-
ment; the agent learns to make decisions by observ-
ing the rewards of its actions from a state, as described
in (Sutton and Barto, 2018), and continuously adapts
to its environment through exploration and exploita-
tion. This makes RL particularly suitable for sequen-
tial decision-making in games, robotics, control sys-
tems, and scheduling problems (Cheng et al., 2021).
Our approach distinguishes itself by relying solely
on RL methods, using actor-critic to explore and ex-
ploit. This contrasts with other packing studies that
frequently incorporate heuristics to guide or direct the
RL algorithm, thereby limiting its scope and creativ-
ity. To our knowledge, no scientific study has ever
addressed this topic in the literature. Therefore, this
paper aims to apply Reinforcement Learning meth-
ods to address a 2D+1 packing problem with spatial
constraints. This problem is an extension of the tra-
ditional 2D packing problem, incorporating an addi-
tional constraint on the height dimension. We also
compare the PPO and A2C as the unique methods that
support multi-discrete action spaces. This research,
inspired by the challenges of aerospace composite
manufacturing, has potential applications in many in-
dustry sectors, including the packing of components
in vehicles, organizing parts in boxes or pallets for
transport and storage, arranging products in-store dis-
plays, and similar optimization tasks across different
sectors.
2 BACKGROUND
This section briefly describes the types of packing
problem and the deep reinforcement learning (DRL)
methods used in this paper.
2.1 Packing
The packing problem is a classic challenge in combi-
natorial optimization that has been extensively stud-
ied for decades by researchers in operations research
and computer science, as noted in (Li et al., 2022).
The primary objective is to allocate objects within
containers, minimizing wasted space efficiently. The
problem can work with regular (Kundu et al., 2019)
and (Zhao et al., 2022b) or irregular shapes
(Crescitelli and Oshima, 2023), often explored in
streaming/online or batching/offline approaches.
Several works based on heuristic approaches have
been proposed for solving packing problems as de-
scribed in (Oliveira et al., 2016), such as the Max-
imum Rectangles - Bottom-Left (Max Rects-BL),
Best-Fit Decreasing Height (BFDH), and Next-Fit
Decreasing Height (NFDH). Max Rects-BL approach
places the largest rectangle in the nearest available
bottom-left corner of a 2D space (Fang et al., 2023a).
BFDH sorts items by descending height and then at-
tempts to place each item, left-justified, on the ex-
isting level with the minimum remaining horizon-
tal space (Seizinger, 2018). In the NFDH approach,
it first arranges the pieces in descending order of
heigthen places each piece on the current level, start-
ing from the left side, as long as there is enough space;
otherwise, it starts a new level (Oliveira et al., 2016).
2.2 Deep Reinforcement Learning
Deep Reinforcement Learning (DRL) addresses the
challenge of autonomously learning optimal decisions
over time. Although it employs well-established su-
pervised learning methods, such as deep neural net-
works for function approximation, stochastic gradi-
ent descent (SGD), and backpropagation, RL applies
these techniques differently, without a supervisor, us-
ing a reward signal and delayed feedback. In this
context, an RL agent receives dynamic states from an
environment and takes actions to maximize rewards
through trial-and-error interactions (Kaelbling et al.,
1996).
The agent and the environment interact in a se-
quence at each discrete time step, t = 0, 1, 2, 3, · · · . At
each time step t, the agent receives a representation
of the environment’s state s
t
∈ S, where S is the set of
possible states, and selects an action a
t
∈ A(s
t
), where
A(s
t
) is the set of actions available in the state s
t
. At
time step t + 1, as a consequence of its actions, the
agent receives a numerical reward r
t+1
∈ R and tran-
sitions to a new state s
t+1
(Sutton and Barto, 2018).
During each iteration, the agent implements a
mapping from states to the probabilities of each possi-
ICEIS 2025 - 27th International Conference on Enterprise Information Systems
502