A Framework for Studying Communication Pathways in Machine

Learning-Based Agent-to-Agent Communication

Sathish Purushothaman, Michael Granitzer, Florian Lemmerich and Jelena Mitrovic

University of Passau, Germany

{sathish.purushothaman, michael.granitzer, ﬂorian.lemmerich, jelena.mitrovic}@uni-passau.de

Keywords:

Multi-Agent Learning, Deep Neural Networks, Language Emergence.

Abstract:

The rise of Large Language Models (LLMs) has increased the relevance of agent-to-agent communication, par-

ticularly in systems where agents learn from their interactions. However, current LLMs offer limited insights

into the communication dynamics among multiple agents, especially in large-scale settings when multiple

agents are involved. Particularly training LLMs - in contrast to in-context learning - becomes nearly infeasible

without large-scale computing infrastructure. In our work we present a machine-learning based agent frame-

work to investigate the role of different communication pathways for studying language emergence between

machine learning-based agents. We designed a transformer-based image auto-encoder as the agent architec-

ture. A Gumbel SoftMax layer encodes images in form of symbols forming the language between agents.

We study two pathways: In the ﬁrst pathway, the sender reads an image and sends a message to the receiver.

The receiver uses the message to reconstruct the sender’s image. In the second pathway, the sender and re-

ceiver read an image and minimize the distance between the generated symbols. In the ﬁrst pathway, language

emerges with the Levenshtein distance of ≤ 2 for 96% of messages. In the second pathway, no language

emerges with the Levenshtein distance of ≤ 2 for 3% of messages.

1 INTRODUCTION

In the deep learning era, the study of emergent com-

munication (Lazaridou and Baroni, 2020) explores

how intelligent, machine learning-based agents in-

teract to address speciﬁc problems. Large language

models (LLMs) have demonstrated a form of agency

(Wu et al., 2023) and the ability to learn from one

another (Dubois et al., 2023). However, ﬁndings in

this area tend to be more anecdotal or algorithmi-

cally derived rather than conceptual. Moreover, the

expansive parameter space of LLMs constrains train-

ing updates, often relegating learning from commu-

nication to mere in-context adaptation. Additionally,

reliance on single models restricts the scope of insight

to various contexts of a single model, which may in-

volve interdependent contexts in contrast to engaging

multiple independent models with distinct separation

points. This limitation is particularly salient in study-

ing phenomena such as the emergence of languages

or the transfer of knowledge between agents through

language. Just as LLMs may not always convey their

”thoughts” explicitly (Turpin et al., 2023), it is plau-

sible that a singular, albeit very large, model does

not fully capture the emergent communication phe-

nomena. Thus, while LLMs exhibit the capability for

agency and inter-agent communication, they do not

constitute an ideal platform for systematically inves-

tigating agent-to-agent communication among a large

number of agents.

Therefore, we propose a small and scalable multi-

agent communication framework to study commu-

nication pathways and structures among multiple

agents. The framework consists of multiple agents,

communication pathways, and a world model deﬁn-

ing perceptions and associations of images and asso-

ciated labels. A successful communication emerges

between two agents when a receiver agent can un-

derstand the sender agent’s message in the way the

sender intended based on the sender’s perception. A

message consists of a sequence of symbols used by

agents for communication. In our problem, we have

a set of images, their corresponding semantics, and

symbols learned by agents through communication.

In particular, two agents successfully communicate

with each other when a receiver agent can reconstruct

the image (or the semantics of the image) seen by

the sender through the sender’s message only. If both

agents can reconstruct the images using each other’s

symbols, we can conclude that a shared language be-

Purushothaman, S., Granitzer, M., Lemmerich, F. and Mitrovic, J.

A Framework for Studying Communication Pathways in Machine Learning-Based Agent-to-Agent Communication.

DOI: 10.5220/0012423500003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 1, pages 117-126

ISBN: 978-989-758-680-4; ISSN: 2184-433X

117

Figure 1: An agent is a transformer-based encoder-decoder architecture. The Gumbel SoftMax layer generates the symbols

of the language. Each agent can share an object and a message with another agent. The communication can be among many

agents. The communication framework allows us to study agent efﬁciency, convergence, and similarities under different

communication pathways.

tween the agents has emerged. We do not ﬁx the sym-

bolic representation but let each agent learn it in an

auto-encoder-like setting.

We study two communication pathways given the

above setup. In the ﬁrst communication pathway,

the agents pay attention to the receiver’s intention in-

stead of the symbol similarity. The receiver agent’s

objective is to reconstruct an image using only the

sender’s symbols. In the second communication path-

way, agents pay attention only to the similarity of

the symbols and not to the receiver’s intention. Each

agent’s objective is to develop a shared language rep-

resentation.

Given this setup, we conduct experiments to an-

swer the following research questions.

1. In the ﬁrst communication pathway, does the suc-

cessful reconstruction of images based on sym-

bols shared between two agents lead to the emer-

gence of language?

2. In the second communication pathway, does the

sharing of symbols between two agents lead to the

emergence of language?

In our work, we create a transformer-based

multi-agent communication framework where mul-

tiple agents can communicate through symbols, as

shown in Figure 1. We designed the agents to rep-

resent what each agent sees in the world in a se-

quence of symbols (e.g., image to text) and represent

what it hears in objects in the world (e.g., text to im-

age). Symbols impose a discrete bottleneck in com-

munication. Agents reuse the same symbols to repre-

sent an inﬁnite combination of objects in the world.

The agents communicate with each other through a

symbolic layer implemented using Gumbel SoftMax

(Jang et al., 2017). The objective function is to mini-

mize the mean squared error over the two agent mes-

sages and the binary cross entropy between the orig-

inal and reconstructed image. We provide a world

model which validates the performance of each agent.

Our results indicate that agents can develop a

shared language within the Levenshtein distance of

2. The receiver agent can reconstruct the image using

the symbols communicated by the sender agent. The

structural similarity index measure (SSIM) between

the sender’s perceived image and the receiver’s recon-

structed image is 0.60. Our shared code is located in

the github repository

2 RELATED WORK

The study of emergent language using deep learn-

ing in a multi-agent setup can be subdivided into re-

inforcement learning and iterative learning (Lazari-

dou and Baroni, 2020; Brandizzi, 2023). In a rein-

forcement learning setup, multiple agents communi-

cate with each other in a referential game or a recon-

struction game (Batali, 1998; Kottur et al., 2017). In a

referential game, a sender agent produces a message

that refers uniquely to a speciﬁc object. The receiver

agent gets the sender’s message and a mixture of ob-

jects containing the sender’s perceived image.

https://github.com/sathishpurushothamanin/transformer-

based-agent-communication-framework

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

118

The receiver agent should refer to the correct ob-

ject pointed by another agent among a group of dis-

similar objects. Both sender and receiver agents get

rewarded when the receiver agent can correctly iden-

tify the object pointed to by the sender.

In a reconstruction game, a sender agent produces

a message that refers uniquely to a speciﬁc object.

The receiver agent has to reconstruct the speciﬁc ob-

ject using the sender’s message. Both agents get re-

warded when there is a successful reconstruction of

the object.

In iterative learning (Rita et al., 2022; Li and

Bowling, 2019; Ren et al., 2020), a student agent

learns from a teacher agent or a population of agents

pass on knowledge over each generation.

Recently, there are a growing number of frame-

works proposed to study the multi-agent communica-

tion (Jinxin et al., 2023; Wu et al., 2023; Park et al.,

2023; Li et al., 2023).

To study compositionality, (Chaabouni et al.,

2019) and (Evtimova et al., 2018) used LSTM net-

work architecture with and without attention, (Słowik

et al., 2020) used a graph-based deep neural net-

work, recently (Ri et al., 2023) investigated the role

of attention-based deep neural networks.

3 TRANSFORMER-BASED

MULTI-AGENT

COMMUNICATION

FRAMEWORK

Humans use natural language to describe the world

using various word combinations with limited vocab-

ulary. As shown in Figure 1, we model the problem as

multiple agents interacting with each other without a

human in the loop. According to the information bot-

tleneck principle (Tishby et al., 2000), the communi-

cation bottleneck with limited communication capac-

ity forces agents to convey only the shortest informa-

tion possible. Each agent reuses the symbols in the

vocabulary to refer to novel objects.

3.1 Communication Pathways

The communication framework presents two key

challenges corresponding to different information

ﬂows between the agents. The ﬁrst challenge is the

ability of the receiver agent to recreate the image seen

only by the sending agent. The second challenge is

how the agents can develop a shared language effec-

tively.

Figure 2: Agent 1 shares its representation of the world with

Agent 2 using a symbolic message. Agent 2 decodes the

message to generate its perception of the world. The ﬁrst

pathway allows us to study if two agents develop a shared

language when each agent recreates an image seen by the

other agent.

Figure 3: Both agents share their symbolic representation

of the world. The second pathway allows us to study if two

agents sharing symbols develop a shared language.

3.1.1 Communication Pathway 1: Semantic

Exchange

As shown in Figure 2, the sender agent can use the

encoder to convert the images into a sequence of sym-

bols and send them to the receiver. The challenge for

the sender is to convey only relevant semantic infor-

mation because of the limited discrete symbolic mes-

sage. The receiver can use the symbols to reconstruct

the images. The challenge for the receiver is to use

only the relevant semantic information to reconstruct

the image. Successful communication emerges when

the receiver can reconstruct the original image with-

out losing the semantic information.

3.1.2 Communication Pathway 2: Shared

Language

As shown in Figure 3, each agent perceives the im-

ages and generates a sequence of symbols. For in-

stance, if two humans look at the digit ’9’ image, they

may use the symbols ’9’ or ’nine’ or ’IX’ etc. The

agents may develop a shared language that may look

dissimilar to our natural language. Successful com-

munication emerges when both agents generate simi-

lar symbolic representations.

3.2 Agent Architecture

Our hypothesis is that languages evolved by paying

attention to various visual features. We introduce a

A Framework for Studying Communication Pathways in Machine Learning-Based Agent-to-Agent Communication

119

Transformer-based agent with skills to generate text

based on visual stimuli and to generate images based

on text stimuli. With this skill the agent is able to

share the message among each other various symbols

based on the stimuli present.

As shown in Figure 1, we designed the agent as

transformer-based encoder-decoder architecture with

a discrete latent space.

3.2.1 Transformer Encoder

The encoder transforms the given image into a se-

quence of symbols. Thus, the agent can speak about

what it sees in the world.

Unless we state otherwise, all the variables X, W,

etc., in the encoding process are real-continuous vari-

ables, that is (X,W ) ∈ R. As shown below, we follow

the feature transformation of the encoder from image

to symbols.

• We transform the image as a sequence of pixels.

• We add learned positional embedding to the se-

quence of pixels.

BHW

= X

BHW

+ POSITIONAL EMBEDDING

• We linearly transform the features into ﬁxed hid-

den dimension size M.

= X

BHW

• We apply dropout to improve learning. (optional)

= Dropout(X

, 0.5)

• We apply multi-headed attention.

= MultiHeadedAttention(X

)

• Finally, we apply layer normalization and linear

transformation to generate the symbolic message.

= LayerNorm(X

, 0.5)

BLC

= X

3.2.2 Transformer Decoder

The decoder transforms a sequence of symbols into a

speciﬁc object. That is, the agent can generate images

using the symbols.

• We add learned positional embedding to the se-

quence of symbols.

BLC

= X

BLC

+ POSITIONAL EMBEDDING

• We transform the message into ﬁxed feature di-

mension M.

= X

BLC

• We apply multi-headed attention.

= MultiHeadedAttention(X

)

• We apply layer normalization and linear transfor-

mation to features. Finally, we use the Sigmoid

function to generate the image.

= LayerNorm(X

, 0.5)

BHW

= X

BHW

= sigmoid(X

BHW

)

3.2.3 Gumbel SoftMax

We used Gumbel SoftMax distribution to overcome

the challenges concerning backpropagating through

discrete latent dimensions.

Traditional variational auto-encoders learn the

data distribution (mean, variance) through the re-

parameterization trick. During training, we draw the

samples from the data distribution learned parameters

(mean, variance) and some Gaussian random noise.

In our problem, we want to backpropagate through a

set of stochastic neurons, representing the hidden rep-

resentation. Ideally, we want the hidden representa-

tion to express a symbolic message. For this purpose,

we use Gumbel-SoftMax distribution, also known as

Concrete distribution. During training, the temper-

ature of Gumbel-SoftMax distribution is allowed to

anneal from 1 to close to 0. Slow annealing allows

the distribution to collapse to a categorical distribu-

tion gradually. We predeﬁned K categories, that is, the

symbol sequence length. Thus, we learn the symbols

using the Gumbel SoftMax trick that enables indirect

differentiation through the discrete space.

3.3 Training Procedure

1. Get a batch of images, say X.

2. Generate batch of symbols S

using an agent i en-

coder: S

= encoder

(X).

3. Reconstruct the images using the agent’s decoder:

X = decoder

4. Calculate the reconstruction loss as a bi-

nary cross-entropy loss between the origi-

nal and the reconstructed images: L

BCE

binary-cross-entropy(

X, X).

5. Calculate the shared language loss as the mean of

the squared error between both agent’s symbols:

MSE

= mean-squared-error(S

, S

6. As shown in Figure 4, we calculate the combined

loss by adding reconstruction loss and shared lan-

guage loss: L

combined

= L

MSE

+ L

BCE

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

120

7. Repeat the above till a ﬁxed number of iterations

or the combined loss goes below a certain prede-

ﬁned threshold.

Figure 4: Each agent minimizes the binary cross entropy

loss between the recreated and the original image. For

shared language to emerge, we minimize the mean squared

error between two agent’s languages.

3.4 World Model

To evaluate the reconstructed images of the agents,

we created a convolution-based classiﬁer that acts as

a World Model. Detailed architecture of the World

Model is shown in Figure 5. The World Model re-

ceives a batch of images as input, passes through the

convolutional neural network, and outputs the classi-

ﬁcation of the digits.

World Model classiﬁes the agents reconstructed

images. If the World Model can classify an agent-

reconstructed image correctly, we can learn that the

agent can communicate meaningful semantic infor-

mation through its symbols. Because we train the

World Model independently, we obtain an unbiased

performance estimate of the agent.

4 EXPERIMENTS

In all our experiments, we restrict the number of sym-

bols allowed to 2 (i.e., 0 or 1) and the communication

sequence length to 12. Thus, we use the terms bits

and symbols interchangeably.

4.1 General Conﬁguration Settings

Following the guidelines of the reproducible experi-

ment, we ﬁx the random seed of pytorch, cudnn, and

numpy. We initialize the Gumbel SoftMax function

with a temperature of 1. We gradually decrease the

temperature at a rate of 0.00003 per iteration. We

turn off the categorical representation by setting hard

to False during training. We turn on the categorical

representation by setting hard to True during the test.

In the transformer architecture, we ﬁxed the number

of multi-headed attention blocks to 6 and the number

of heads to 8. We ﬁxed the training and test batch

Figure 5: World Model architecture is an MNIST convolu-

tional neural network classiﬁer.

sizes to 100. We repeat the training procedure for 200

epochs.

4.2 Evaluation Metrics

The agents communicate with each other by sharing

generated images and symbols. To assess the agent

reconstructed image, we use F1 metric (Davis and

Goadrich, 2006) and structural similarity index mea-

sure (Wang et al., 2003). We evaluate the shared arti-

ﬁcial language using the Levenshtein distance.

F1 Metric. F1 Score in Equation 1 is the harmonic

mean of precision and recall.

f1-score = 2 ∗

precision ∗ recall

precision + recall

(1)

Structural Similarity Index Measure. SSIM mea-

sures the structural difference between any two given

images. SSIM values range from 0 to 1, where near 0

values indicate that the two images are not structurally

similar. The SSIM value near 1 indicates that the two

images are structurally similar. We compare SSIM

between the test image and the agent-recreated test

image. SSIM metric complements the World Model

classiﬁcation metrics in quantitatively expressing how

far the agent captures the semantic and structural in-

formation.

A Framework for Studying Communication Pathways in Machine Learning-Based Agent-to-Agent Communication

121

Table 1: We measure the World Model F1 score and the SSIM on the receiver agent reconstructed images under different

communication pathways. If there is no communication between agents or when the agents share messages alone, agents

cannot reconstruct an image using each other’s symbols. Thus, the reconstructed images have a low F1 score and SSIM.

When the receiver agent learns to reconstruct a sender-perceived image, the reconstructed images achieve an F1 score of circa

0.86 and an SSIM of 0.60.

Communication\Metrics F1 Score SSIM

No Communication 0.10 0.09

Shared Reconstruction 0.87 0.60

Shared Message 0.11 0.09

Shared Reconstruction and Message 0.86 0.60

Table 2: We measured the Levenshtein distance between two agents under different communication pathways. When there is

no communication between agents or when the agents share only messages, agents do not develop a shared language as the

Levenshtein distance between the sender and receiver agents is longer than 2 for all the test images. When the sender and

receiver agents learn to reconstruct the images using each other symbols, the sender and receiver develop a shared language

as the distance between the symbols is less than 2 for the majority of the test images.

Communication\Edit Distance 0 1 2 3 ≥4

No Communication 0 0 2 8 90

Shared Reconstruction 31 36 29 3 1

Shared Message 0 1 2 13 84

Shared Reconstruction and Message 28 43 23 6 0

Figure 6: Left: Agent 1 shared reconstruction loss measures the binary cross entropy loss between Agent 2 perceived image

and Agent 1 reconstructed image using the symbols. Right: Shared message loss measures the mean squared error between

two agent representations of the digits in the image. Throughout the training, the agent shared reconstruction and shared

message loss increase even if the agents optimize to generate a similar semantic mapping. The loss of the agents when they

share the symbols alone is comparable to the agent loss that does not communicate at all. Both shared reconstruction loss and

shared message loss converge when we train each agent to reconstruct the image using the other agent’s symbols.

Levenshtein Distance Metric. Levenshtein dis-

tance metric (Levenshtein et al., 1966; Hyyr

o, 2001)

measures the least number of inserts, delete, and sub-

stitute operations required to change one string into

another string. Levenshtein metric allows us to mea-

sure how far apart the languages developed by each

agent in a multi-agent communication framework.

4.3 Communication Pathways

To answer our research questions, we created four

experiments. Each experiment studies the learning

behavior of the agents under four scenarios. For

each scenario, we measure the shared reconstruction

loss as the binary cross entropy loss between the

receiver-generated image and the perceived image of

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

122

Figure 7: Left: Agent 1 reconstruction loss measure the binary cross entropy loss between the perceived and the reconstructed

image using the symbols. Right: Agent 2 reconstruction loss measure the binary cross entropy loss between the perceived and

the reconstructed image using the symbols. We measured the reconstruction loss for each agent under different communication

pathways. Our results show that the internal perception of the agents remains consistent throughout training.

Table 3: Mean loss of the last 50 epochs of different communication pathways repeated for ten random iterations. Overall,

the shared reconstruction loss and the shared message loss do not show more deviation. The results suggest that language

emergence in the ﬁrst communication pathway and no language emergence in the second communication pathway are not due

to random variations in the training procedure.

Pathways\Metrics Agent 1 Shared Reconstruction Loss Shared Message Loss

No Communication 137019.29 +/- 9275.42 0.42 +/- 0.01

Shared Reconstruction 11159.43 +/- 63.15 0.05 +/- 0.00

Shared Message 138954.43 +/- 15739.21 0.43 +/- 0.03

Shared Reconstruction and Message 11538.64 +/- 0.0 0.08 +/- 0.00

Table 4: We conducted ablation studies on different architecture choices for the agents. We report the World Model F1 score

and the SSIM on the agent-reconstructed images on the test set. The transformer-based autoencoder architecture outperformed

the convolution-based and fully connected network-based auto-encoder architecture in all metrics.

Metric\Architecture Convolution Fully Connected Transformer

F1 Score 0.78 0.83 0.88

SSIM 0.40 0.46 0.61

Sequence Length 30 30 12

Number of Symbols 10 2 2

the sender agent. We measure the shared message loss

as the mean squared error between the ﬁrst and the

second agent symbols.

1. No Communication. We train the agents to recon-

struct images using their language. Thus, a classic

encoder-decoder setting with a binary bottleneck

layer.

2. Shared Reconstruction. We train the agents to re-

construct images using their language. Then, each

agent reconstructs the image using the symbols of

another agent. The objective function minimizes

the shared reconstruction loss.

3. Shared Message. We train the agents to recon-

struct the images using their language. The objec-

tive function minimizes the shared language loss.

4. Shared Reconstruction and Shared Message. We

train the agents to reconstruct images using their

language. Then, each agent reconstructs the im-

age using the symbols of another agent. The ob-

jective function minimizes the shared reconstruc-

tion loss and shared message loss.

As shown in Figure 6, the shared language loss

goes down if and only if the shared reconstruction loss

goes down. As shown in Table 1, the agent-generated

A Framework for Studying Communication Pathways in Machine Learning-Based Agent-to-Agent Communication

123

images perform worse on the test set if we do not op-

timize for the shared reconstruction loss. The World

Model accuracy is circa 86, and SSIM is 0.60 only

when the model optimizes with the shared reconstruc-

tion loss. As shown in Table 2, the agents generate

similar symbols if the receiver agent can reconstruct

the sender images. To answer our ﬁrst research ques-

tion, yes, the receiver agent can successfully recon-

struct an image perceived by the sender only when

the agents learn the symbols concerning the sender’s

image.

Even optimizing speciﬁcally for shared language

does not result in language emergence and follows the

loss trajectory of agents that do not communicate. A

similar effect is observed in the test set as shown in

Table 2. There exist no similar symbols even when

optimizing for shared messages. To answer our sec-

ond research question, no, agents sharing only the

symbols do not develop a shared language. Without

intention, the language can not emerge between two

agents. Optimizing the agents concerning symbols

but ignoring what those symbols represent results in

poor language learning.

Moreover, each agent has an internal perception

of the world. As shown in Figure 7, we measured

the reconstruction loss for each agent under different

communication pathways. Our results show that the

internal perception of the agents remains consistent

throughout training.

To better obtain conﬁdence in the results, we re-

peated the experiment for ten random iterations for

each communication pathway, and the loss measure is

shown in Table 3. There is less variation when there

is no communication between agents and when the

agents share only the symbols.

4.3.1 Ablation Studies

We created a fully connected network-based and

convolution-based auto-encoder to benchmark

against our transformer-based agent architecture. The

architecture of fully connected network-based agent

architecture is shown in the Table 5 and 6 in the

Appendix. The convolution-based agent architecture

is shown in the Table 7 and 8 in the Appendix.

We increased the number of symbols and se-

quence length because the lower number of symbols

results in poor representation in fully connected and

convolution-based agents. As shown in Table 4, the

transformer-based network has learned better sym-

bolic representation and has a higher F1 score and

SSIM.

5 CONCLUSIONS

In this paper, we have introduced a framework for

studying the communication between transformer-

based deep learning agents. In our experiments, we

could observe the emergence of a shared language

between the agents if the receiver successfully recon-

structs the image using the message from the sender

agent. We have shown that two agents can develop a

shared language within the Levenshtein distance of 2.

The reconstructed images achieve a 0.87 F1-score ob-

tained from an independent World Model. We found

that agents focusing only on increasing the similar-

ity of symbols do not result in the emergence of lan-

guage.

In the current stage, we see several limitations

with our studies, which we will aim to address in fu-

ture work: First, we have only studied the framework

concerning agents cooperating to develop a shared

language, neglecting potential other relations between

models. Next, we assume agents to have a ﬁxed vo-

cabulary throughout the training and test times. Nat-

ural language is variable in sequence length. Finally,

we experimented with only one pair of agents. An av-

enue for future work would be to expand the frame-

work to many agents interacting with each other.

REFERENCES

Batali, J. (1998). Computational simulations of the emer-

gence of grammar. Approaches to the Evolution of

Language-Social and Cognitive Bases-.

Brandizzi, N. (2023). Towards More Human-like AI Com-

munication: A Review of Emergent Communication

Research. arXiv e-prints, page arXiv:2308.02541.

Chaabouni, R., Kharitonov, E., Lazaric, A., Dupoux, E., and

Baroni, M. (2019). Word-order biases in deep-agent

emergent communication.

Davis, J. and Goadrich, M. (2006). The relationship be-

tween precision-recall and roc curves. In Proceed-

ings of the 23rd international conference on Machine

learning, pages 233–240.

Dubois, Y., Li, X., Taori, R., Zhang, T., Gulrajani, I., Ba, J.,

Guestrin, C., Liang, P., and Hashimoto, T. B. (2023).

Alpacafarm: A simulation framework for methods

that learn from human feedback.

Evtimova, K., Drozdov, A., Kiela, D., and Cho, K. (2018).

Emergent communication in a multi-modal, multi-

step referential game.

Hyyr

o, H. (2001). Explaining and extending the bit-

parallel approximate string matching algorithm of my-

ers. Technical report, Citeseer.

Jang, E., Gu, S., and Poole, B. (2017). Categorical repa-

rameterization with gumbel-softmax.

Jinxin, S., Jiabao, Z., Yilei, W., Xingjiao, W., Jiawen, L.,

and Liang, H. (2023). CGMI: Conﬁgurable General

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

124

Multi-Agent Interaction Framework. arXiv e-prints,

page arXiv:2308.12503.

Kottur, S., Moura, J. M., Lee, S., and Batra, D. (2017).

Natural language does not emerge’naturally’in multi-

agent dialog. arXiv preprint arXiv:1706.08502.

Lazaridou, A. and Baroni, M. (2020). Emergent multi-

agent communication in the deep learning era. arXiv

preprint arXiv:2006.02419.

Levenshtein, V. I. et al. (1966). Binary codes capable of cor-

recting deletions, insertions, and reversals. In Soviet

physics doklady, volume 10, pages 707–710. Soviet

Union.

Li, F. and Bowling, M. (2019). Ease-of-teaching and lan-

guage structure from emergent communication.

Li, G., Hammoud, H. A. A. K., Itani, H., Khizbullin, D., and

Ghanem, B. (2023). Camel: Communicative agents

for ”mind” exploration of large scale language model

society.

Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang,

P., and Bernstein, M. S. (2023). Generative agents:

Interactive simulacra of human behavior.

Ren, Y., Guo, S., Labeau, M., Cohen, S. B., and Kirby, S.

(2020). Compositional languages emerge in a neural

iterated learning model.

Ri, R., Ueda, R., and Naradowsky, J. (2023). Emergent

communication with attention.

Rita, M., Strub, F., Grill, J.-B., Pietquin, O., and Dupoux,

E. (2022). On the role of population heterogeneity in

emergent communication.

Słowik, A., Gupta, A., Hamilton, W. L., Jamnik, M.,

Holden, S. B., and Pal, C. (2020). Structural Inductive

Biases in Emergent Communication. arXiv e-prints,

page arXiv:2002.01335.

Tishby, N., Pereira, F. C., and Bialek, W. (2000). The infor-

mation bottleneck method.

Turpin, M., Michael, J., Perez, E., and Bowman, S. R.

(2023). Language models don’t always say what they

think: Unfaithful explanations in chain-of-thought

prompting. arXiv preprint arXiv:2305.04388.

Wang, Z., Simoncelli, E. P., and Bovik, A. C. (2003). Mul-

tiscale structural similarity for image quality assess-

ment. In The Thrity-Seventh Asilomar Conference on

Signals, Systems & Computers, 2003, volume 2, pages

1398–1402. Ieee.

Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang,

L., Zhang, X., Zhang, S., Liu, J., Awadallah, A. H.,

White, R. W., Burger, D., and Wang, C. (2023). Au-

togen: Enabling next-gen llm applications via multi-

agent conversation.

Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E.,

Li, B., Jiang, L., Zhang, X., and Wang, C. (2023).

AutoGen: Enabling Next-Gen LLM Applications via

Multi-Agent Conversation Framework. arXiv e-prints,

page arXiv:2308.08155.

APPENDIX

Fully Connected Network-Based

Architecture

We created a simple fully connected network-based

auto-encoder architecture for the agents. As shown

in Table 5, the encoder agents receive an image with

a ﬁxed image height and width. The encoder then

transforms the features into a series of fully connected

layers with a ﬁxed hidden dimension size speciﬁed

in hidden dim. Finally, the encoder outputs the se-

quence of symbols via the Gumbel SoftMax layer.

As shown in Table 6, the decoder receives a se-

quence of symbols and outputs the image after pass-

ing through a series of fully connected layers with a

ﬁxed hidden dimension of size hidden dim.

Convolution-Based Architecture

We created a convolution-based auto-encoder archi-

tecture for the agents. As shown in Table 7, the en-

coder agents receive an image with a ﬁxed image

height and width. The ﬁrst two convolutional layers

extract the features in the image using 2D convolu-

tion with a kernel size of 3. Next, we apply the Max-

Pool2D layer with a kernel size of 2. During train-

ing, the dropout function on the output feature with a

probability of 0.25. We ﬂatten the features before ap-

plying the fully connected layer. Next, we apply two

fully connected layers with an optional dropout layer.

Finally, the encoder outputs the sequence of symbols

via the Gumbel SoftMax layer.

As shown in Table 8, The decoder receives a se-

quence of symbols and outputs the image after pass-

ing through a series of fully connected layers with a

ﬁxed hidden dimension of size hidden dim and 9216.

A Framework for Studying Communication Pathways in Machine Learning-Based Agent-to-Agent Communication

125

Table 5: We created a fully connected network-based encoder to test various choices for the multi-agent communication

framework. ’-’ denotes that there are no trainable parameters in that layer. The height and width represent the image size,

and hidden dim is the hidden dimension size. laten dim and categorical dim represent the sequence length and the number of

symbols.

Layer Input Output

Linear image: height*width hidden dim

ReLU - -

Linear hidden dim hidden dim

ReLU - -

Linear hidden dim laten dim*categorical dim

Gumbel SoftMax laten dim*categorical dim language: sequence of symbols

Table 6: We created a fully connected network-based decoder to test various choices for the multi-agent communication

framework. ’-’ denote that there are no trainable parameters in that layer. The height and width represent the image size, and

hidden dim is the hidden dimension size.

Layer Input Output

Linear language: sequence of symbols hidden dim

ReLU - -

Linear hidden dim hidden dim

ReLU - -

Linear hidden dim image: height*width

Table 7: We created a Convolution-based agent encoder architecture to test various choices for the multi-agent communication

framework. ’-’ denotes that there are no trainable parameters in that layer. The height and width represent the image size,

and hidden dim is the hidden dimension size. latent dim and categorical dim represent the sequence length and the number

of symbols.

Layer Input Output

Conv2D image: height*width 32

ReLU - -

Conv2D 32 64

ReLU - -

MaxPool2D - -

Dropout - -

Flatten - 9216

ReLU - -

Linear 9216 128

ReLU - -

Dropout - -

Linear 128 64

ReLU - -

Linear 64 laten dim*categorical dim

Gumbel SoftMax laten dim*categorical dim language: sequence of symbols

Table 8: We created a fully connected network-based agent to test various choices for the multi-agent communication frame-

work. ’-’ denotes that there are no trainable parameters in that layer. The height and width represent the image size, and

hidden dim is the hidden dimension size. latent dim and categorical dim represent the sequence length and the number of

symbols.

Layer Input Output

Linear language: sequence of symbols hidden dim

ReLU - -

Linear hidden dim hidden dim

ReLU - -

Linear hidden dim 9216

ReLU - -

Linear 9216 image: height*width

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

126