Image Generation from a Hyper Scene Graph

with Trinomial Hyperedges

Ryosuke Miyake, Tetsu Matsukawa

and Einoshin Suzuki

Graduate School and Faculty of Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan

Keywords:

Image Generation, Scene Graph, Hyper Graph, Generative Adversarial Network.

Abstract:

Generating realistic images is one of the important problems in the ﬁeld of computer vision. In image genera-

tion tasks, generating images consistent with an input given by the user is called conditional image generation.

Due to the recent advances in generating high-quality images with Generative Adversarial Networks, many

conditional image generation models have been proposed, such as text-to-image, scene-graph-to-image, and

layout-to-image models. Among them, scene-graph-to-image models have the advantage of generating an

image for a complex situation according to the structure of a scene graph. However, existing scene-graph-to-

image models have difﬁculty in capturing positional relations among three or more objects since a scene graph

can only represent relations between two objects. In this paper, we propose a novel image generation model

which addresses this shortcoming by generating images from a hyper scene graph with trinomial edges. We

also use a layout-to-image model supplementally to generate higher resolution images. Experimental valida-

tions on COCO-Stuff and Visual Genome datasets show that the proposed model generates more natural and

faithful images to user’s inputs than a cutting-edge scene-graph-to-image model.

1 INTRODUCTION

Generating realistic images is one of the important

problems in the ﬁeld of computer vision. Image

generation can be applied in various ﬁelds (Agnese

et al., 2020; Wu et al., 2017) such as medicine (Nie

et al., 2017; Ghorbani et al., 2020) and art (Elgam-

mal et al., 2017). In the ﬁeld of art, it could be

useful for artists and graphic designers. In the fu-

ture, when higher-quality images can be generated,

image or video search engines can be replaced by al-

gorithms which generate customized content based on

user preferences (Johnson et al., 2018).

In image generation tasks, generating images con-

sistent with the input given by the user is called condi-

tional image generation. In recent years, advances in

research on Generative Adversarial Networks (GAN),

(Goodfellow et al., 2014) have improved the qual-

ity of generated images, and many conditional image

generation models have been proposed. Among them,

text-to-image models (Reed et al., 2016; Zhang et al.,

2017; Zhang et al., 2018; Odena et al., 2017) gener-

ate images conditioned on a text such as “the sheep

is on the grass”. These models have the advantage

https://orcid.org/0000-0002-8841-6304

https://orcid.org/0000-0001-7743-6177

of their simple input and their ease of application in

many ﬁelds. Meanwhile, they have the disadvantage

of their difﬁculty in generating images which repre-

sent complex situations with many objects and their

relations. This drawback is attributed to the difﬁculty

in generating a single feature vector which contains

all of the information in a long sentence.

A scene graph represents situations in a similar

way to a text (Johnson et al., 2015). It consists of

nodes representing objects such as “dog” and bino-

mial edges representing a relation between two ob-

jects such as “on”. In the scene-graph-to-image mod-

els, each object or edge label in a scene graph is trans-

formed into a feature vector. In other words, text-to-

image models convert a sentence into a feature vector,

while scene-graph-to-image models generate feature

vectors from each word in a scene graph. Thus, en-

coding a scene graph into a feature vector is easier,

and this fact enables these models to generate proper

images for complex situations.

Though scene-graph-to-image models have such

an advantage, they have a shortcoming: the positional

relations among three or more objects tend to be inac-

curate (Figure1 (a)). An edge in a scene graph can

represent only relations between two objects, so it

has difﬁculty in capturing a positional relation among

three or more objects. This inaccurate object position-

Miyake, R., Matsukawa, T. and Suzuki, E.

Image Generation from a Hyper Scene Graph with Trinomial Hyperedges.

DOI: 10.5220/0011699300003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages

185-195

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

185

ing also causes overlapping of positions of the gen-

erated objects, which causes another shortcoming of

generating unclear objects (Figure 1 (b)).

We address the former shortcoming by generat-

ing images from the hyper scene graph with trinomial

hyperedges each of which represents the positional

relation among three objects. If the shortcoming of

the positional relationship is improved with hyper-

edges, the object overlapping is reduced, resulting in

sharper objects and images. Moreover, we use layout-

to-image model Layout2img (He et al., 2021) supple-

mentally to generate high resolution images.

Scene Graph

(a)

Layout

Image

(b)

Figure 1: Example of scene-graph-to-model sg2im (John-

son et al., 2018) failing to generate images which are not

consistent with the input. The layout in the second column

is the intermediate product of sg2im. The scene graph in

(a) has a path “window” (green bounding box) → “left of”

→ “man” (purple). However, in the generated layout, “win-

dow” (green) surround “man” (purple) and is not positioned

in the left of “man” (purple). In the generated image in (b),

the entire image is unclear, and thus it is difﬁcult to insist

that the objects in the scene graph are generated.

2 RELATED WORKS

Image generation models fall into three categories: (i)

GAN (Goodfellow et al., 2014; Radford et al., 2015),

(ii) VAE (Kingma and Welling, 2013), and (iii) au-

toregressive models (Van Den Oord et al., 2016). (i)

A GAN consists of a generator G and a discrimina-

tor D. The generator generates data x

′

from noise z

sampled from noise distribution p

. The discriminator

outputs the probability that the given data is not x

′

but

x sampled from the training data. These two models

are trained to compete with each other. (ii) VAE con-

sists of an encoder and a decoder. The encoder takes

an image as input and outputs a latent variable. The

decoder aims to recover the original image from the

latent variable. The two networks are trained simul-

taneously. (iii) The autoregressive model generates

an image by sequentially generating the value of each

pixel conditioned on all previously generated pixels.

In these three models, GAN is widely used in con-

ditional image generation models due to the realistic

look of the generated images and its ease of applica-

tion to various models. In this paper, we also employ

GAN as the generative model.

Major inputs for conditional image generation

models using GAN are (1) text (Reed et al., 2016;

Zhang et al., 2017; Zhang et al., 2018; Odena et al.,

2017), (2) layouts (Sun and Wu, 2019; He et al., 2021;

Hinz et al., 2019; Hinz et al., 2022), and (3) scene

graphs (Johnson et al., 2018; Li et al., 2019; Vo and

Sugimoto, 2020; Mittal et al., 2019). In the following,

we categorize each generative model by these three

kinds of inputs and explain them.

(1) The text-to-image models have been studied

extensively due to their advantages of their simple in-

put and their ease of application in many ﬁelds. Reed

et al. extended cGAN (Mirza and Osindero, 2014)

to generate images which are aligned with the input

semantically by using GAN-INT-CLS (Reed et al.,

2016). Odena et al. proposed AC-GAN (Odena

et al., 2017), in which the discriminator solves a task

of identifying the class of a given image in addition

to the general discriminative problem, and tried to

generate images where each object can be identiﬁed.

These models can generate an image consistent with

the input for a simple situation which involves few ob-

jects and few relations. However, they have difﬁculty

in generating images representing complex situations

with many objects and relations between them.

(2) Layout-to-image models and (3) scene-graph-

to-image models were proposed to overcome the

shortcoming of text-to-image models (Johnson et al.,

2018; Hinz et al., 2019). (2) He et al. proposed a

layout-to-image model Layout2img (He et al., 2021).

This model generates consistent feature vectors for

each object and generates natural images. The layout-

to-image models can control the position of the gen-

erated object directly by the input. On the other hand,

they have difﬁculty in being applied to image gener-

ation from a text due to the dissimilarity of the struc-

tures between a text and a layout.

(3) Johnson et al. proposed a scene-graph-to-

image model sg2im (Johnson et al., 2018). This

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

186

model processes a scene graph using a graph convo-

lutional network. Compared with the text-to-image

model StackGAN (Zhang et al., 2017), sg2im gener-

ates images which are semantically more consistent

with the input. Scene-graph-to-image models pos-

sess two main advantages, i.e., the ease of convert-

ing the input to a feature vector and the ease of ap-

plying them to text-to-image models. The former is

due to their ease of making feature vectors. A text-to-

image model converts an entire sentence into a feature

vector, while a scene-graph-to-image model converts

each word into a feature vector. The latter is attributed

to the similarity of the structures between a text and a

scene graph. Actually, there is research on the trans-

formation from a text to a scene graph (Schuster et al.,

2015). On the other hand, as explained in Section 1,

the scene-graph-to-image model has difﬁculty in cap-

turing the positional relations among three or more

objects correctly.

In this paper, we focus on scene-graph-to-image

models due to their advantages and aim to overcome

its shortcoming. In addition, we use the layout-to-

image model as a supplement to generate higher res-

olution images.

3 TARGET PROBLEM

To make the distinction between Johnson et al.’s work

and ours clear, ﬁrst, we explain their target problem

(Johnson et al., 2018): image generation from a scene

graph. Then, we describe our target problem: image

generation from a hyper scene graph, and the evalua-

tion metrics used in this paper.

3.1 Image Generation from a Scene

Graph

Johnson et al. set the target problem to creating gen-

erator G(S, z) which generates image

I from scene

graph S = (V, E) and Gaussian noise z. Each object

∈ V has a category such as “dog” and “sky”. V de-

notes the set of nodes in the scene graph, where V =

, . . . , v

}, and n represents the number of nodes. A

node represents an object. Category c

∈ C of object

is denoted by c

= f (v

), where C is the set of all

categories of objects and f (·) is a mapping from ob-

ject v ∈ V to category c ∈ C of objects. E denotes

the set of binomial edges in a scene graph, satisfy-

ing E ⊆ V × R

×V , where R

is the entire set of la-

bels for binomial relations (“on”, “left of”, etc.). Note

that for (v

, r

, v

) ∈ E, i ̸= k. A binomial edge is di-

rected: (v

, r

, v

) and (v

, r

, v

) are distinct. Figure 2

(a) shows an example of a scene graph.

3.2 Image Generation from a Hyper

Scene Graph

We deﬁne the hyper scene graph as a scene graph

with an additional hyperedge that represents a rela-

tion among three or more objects. In this paper,

we focus on relations among three objects for sim-

plicity. As described in Section 1, sg2im (Johnson

et al., 2018) has a shortcoming: the positional re-

lations in the generated image among three or more

objects tend to be inaccurate. To address this short-

coming, we set our target problem to creating gen-

erator G(H,z) which generates image

I from hy-

per scene graph H = (V, E, Q) and Gaussian noise

z. Q denotes the set of trinomial hyperdeges in

H, which satisﬁes Q ⊆ V × R

× V × V . R

de-

notes the entire set of labels (such as “between”)

for the trinomial relation. Trinomial hyperedge

, r

, v

) ∈ Q satisﬁes i ̸= k, i ̸= l, k ̸= l and

is directed: (v

, r

, v

), (v

, r

, v

), (v

, r

, v

, r

, v

), (v

, r

, v

), and (v

, r

, v

) are dis-

tinct. The deﬁnitions of V and E are the same as in

Section 3.1. Figure 2 (b) shows an example of a hyper

scene graph.

(a)

(b)

Figure 2: Example of a scene graph (a) and a hyper

scene graph (b). The red box and blue box represent

an object and an edge label, respectively. The white

arrow and green arrow represent a binomial edge and

a trinomial hyperedge, respectively. Set V of objects

and category f (V ) of objects are the same in both (a)

and (b), and they are given by V = {v

, v

f (V) = {“cat”, “dog”, “sheep”, “grass”}. Set E

and E

binomial edges are given by E

= {(v

, r

(“left of”), v

, r

(“left of”), v

), (v

, r

(“on”), v

)} and

= {(v

, r

(“on”), v

)}, respectively. Set Q

of trinomial

hypereges is given by Q

= {(v

, r

(“between”), v

, v

)}.

The path “cat” → “left of” → “dog” corresponds to the

binomial edge (v

(“cat”)), “left of”, v

(“dog”)) and means

that a cat exists to the left of a dog. Also, the path “cat” →

“between ” → “dog” → “sheep” corresponds to the trino-

mial edge (v

(“cat”)), “between”, v

(“dog”), v

(“sheep”))

and means from left to right, a cat, a dog, and a sheep align

in a row.

Image Generation from a Hyper Scene Graph with Trinomial Hyperedges

187

3.3 Evaluation Metrics

There are various evaluation measures for conditional

image generation models, such as the ﬁdelity and the

diversity of the generated images to the input, the

clarity of the boundaries, and the robustness to small

changes in the input (Frolov et al., 2021). In this pa-

per, we aim to achieve the following three goals.

A). Improving the positional relation among three ob-

jects connected to a hyperedge.

B). Making the overlapping of the objects small.

C). Making the generated image natural.

As the evaluation metrics for (A)∼(C), we use

Positional relation of Three Objects (PTO), Area of

Overlapping (AoO), and Inception Score (Salimans

et al., 2016), respectively. PTO and AoO are new

measures proposed in this paper. As an evaluation

measure for (C), we could also use the Fr

echet In-

ception Distance (FID) (Heusel et al., 2017), which

is the distance between the distribution of the embed-

ded representations of the real images and the gener-

ated images. However, we use Inception Score in line

with (Johnson et al., 2018) in this paper.

PTO is the percentage of correctly generating

three bounding boxes b

, b

for v

, v

satisfying

, “between”, v

, v

) ∈Q. We judge that bounding

boxes b

, b

are correctly generated if they satisfy

the following ﬁve conditions for the addition of a tri-

nomial hyperedge “between”.

• v

, v

are lined up from left to right in this order,

where each of objects does not overlapping, i.e.,

< x

• v

, v

are not large objects such as the back-

ground, i.e., x

− x

< 0.7w and x

− x

< 0.7w

and x

−x

< 0.7w, where w is the image width.

• v

, v

are of nearly equal size, i.e.,

−x

< 2 and

−y

< 2 and

−x

< 2 and

−y

< 2.

• v

, v

are not largely apart horizontally,

i.e., max(x

− x

, x

− x

)0.5 > x

− x

and

max(x

− x

, x

− x

)0.5 > x

− x

• v

, v

are not largely apart vertically, i.e.,

max(y

− y

, y

− y

)0.7 > y

− y

and

max(y

− y

, y

− y

)0.7 > y

− y

Here, b

= (x

, x

, y

), (x

< x

, y

< y

which means that the bounding box is a rectangle with

vertices (x

, y

), (x

, y

), (x

, y

), (x

, y

). PTO is

deﬁned by

PTO =

the number of correctly generated sets

the number of the evaluated sets

. (1)

For sg2im (Johnson et al., 2018), which employs no

hyperedges, we evaluate whether objects (v

, v

)

satisfy (v

, “left of”, v

) and (v

, “left of”, v

) ∈ E.

AoO examines whether the introduction of the tri-

nomial hyperedge reduces the overlapping of objects.

AoU measures the overlapping of three objects con-

nected to a trinomial hyperedge. AoU is deﬁned as

follows.

AoO(b

, b

) =

IoU(b

, b

) + IoU(b

, b

) + IoU(b

, b

) (2)

where IoU (Intersection over Union) (Rezatoﬁghi

et al., 2019) is a measure for evaluating the overlap-

ping of two bounding boxes. IoU has been used to

evaluate the overlapping of generated boxes and the

ground truth boxes in the dataset. However, our ob-

jective is to improve the relative position between the

generated bounding boxes as we explained in Section

1. Thus, we use IoU for the evaluation of the over-

lapping of three objects here. Let the two bounding

boxes be X, Y and S(·) be the area of the input region.

Then, IoU is deﬁned as follows.

IoU(X,Y ) =

S(X ∩Y)

S(X ∪Y)

(3)

The smaller AoO is, the smaller the overlapping be-

tween the objects is, which indicates that the objects

are generated with appropriate positioning. In sg2im,

we evaluate whether the bounding boxes of v

, v

satisfy (v

, “left of”, v

), (v

, “left of”, v

) ∈ E.

We use Inception Score (IS) (Salimans et al.,

2016) as a measure for evaluating the naturalness of

the generated image. IS is calculated using Inception

Network trained on ImageNet (Russakovsky et al.,

2015; Szegedy et al., 2015) with the following equa-

tions.

IS = exp[E

(p(y|

I)∥p(y)]], (4)

p(y) =

∑

i=1

p(y|

), (5)

where p(y|

I) is the probability of label y of input im-

age

I predicted by Inception Network and p(y) is its

marginal probability. The score becomes larger when

the labels of the generated images are easily identi-

ﬁed and the identiﬁed labels are diverse. Therefore,

we can use Inception Score as the evaluation measure

for the naturalness of the generated image.

4 ORIGINAL MODEL

We design our model based on an image generation

model from a scene graph sg2im (Johnson et al.,

2018). In this Section, we explain sg2im (Johnson

et al., 2018) and its shortcoming.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

188

4.1 sg2im (Johnson et al., 2018)

First, we show an overview of generator G of sg2im

(Johnson et al., 2018) in Figure 3 (a). In the gener-

ator of sg2im, a hyper scene graph is a scene graph,

the hyper graph convolutional network is the graph

convolutional network, a layout is a feature map, and

Layout2img is Cascaded Reﬁnement Network (Chen

and Koltun, 2017). The image generation in the gen-

eration phase is performed as follows.

1. Each v ∈ V of the objects and binomial relation

labels r ((·, r, ·) ∈ E) in scene graph S is trans-

formed into object vector ν and relation vector ρ

by the object embedding network and the relation

embedding network, respectively.

2. We apply graph convolution on object vector ν

and relation vector ρ based on scene graph S and

obtain convoluted object vectors.

3. The box regression network is applied to the con-

voluted object vectors to predict bounding box

which represents the region to generate each ob-

ject.

4. We generate feature map M, which is an interme-

diate representation between a scene graph and an

image region, by mapping the convoluted object

vectors based on their bounding boxes

5. Feature map M with Gaussian noise z is fed

into the Cascaded Reﬁnement Network (Chen and

Koltun, 2017) to generate image

The graph convolution in step 2 is performed to

convert the feature vector of each object so that it

considers the entire scene graph. Figure 4 shows

the process ﬂow of the ﬁrst layer of the graph con-

volution network when the scene graph in Figure 2

(a) is the input. net1 is a multi-layer perceptron ap-

plied to binomial relations, which takes as input vec-

tors (ν

, ρ

, ν

) corresponding to e = (v

, r

, v

) ∈ E,

∈ R

) and outputs vectors (ν

i j

, ρ

′

, ν

k j

). Then,

pooling and dimensionality reduction by net2 are per-

formed for each object vector, and the ﬁrst layer of

the graph convolution is completed. The graph con-

volution network has ﬁve layers, and the output of the

previous layer is used as the input to the next layer.

By repeating this process, information on each ob-

ject vector and relation vector is propagated along the

edges.

Next, we describe the discriminators. In sg2im

(Johnson et al., 2018), there are two discriminators:

img

and D

ob j

. D

img

takes a generated image or a

training image as input and outputs the probability

that the input is a generated image. D

ob j

takes as input

a generated object in the generated image or an object

in the dataset image and outputs the probability that

the input is a generated object and the probability that

its category is c ∈ C . By identifying the category of

each object with D

ob j

, we can learn the semantic con-

sistency between the word and the image.

4.2 Shortcomings of sg2im (Johnson

et al., 2018)

As described in Section 1, sg2im (Johnson et al.,

2018) has a shortcoming: the positional relations

among three or more objects tend to be inaccurate.

The edges in a scene graph can represent only rela-

tions between two objects. Therefore, a multilayer

perceptron (MLP) in the graph convolution network is

applied to the features of three objects in two separate

steps. For example, for a path “cat” → “left of” →

“dog” →“left of” → “sheep” in the scene graph, the

convolution is performed on the two edges “cat” →

“left of” → “dog” and “dog” → “left of” → “sheep”.

This fact leads to the difﬁculty of sg2im in capturing

the positional relations among three or more objects.

Moreover, sg2im (Johnson et al., 2018) has difﬁ-

culty in generating the objects clearly due to the low

resolution (64×64) of the generated images when the

input scene graph involves many objects. We con-

ﬁrmed that simply increasing the resolution of the

generated images does not work well due to the mode

collapse (Gui et al., 2021), which is a phenomenon

that the generated images are all similar to each other

due to the training failure.

5 PROPOSED MODEL

To address the shortcoming of the existing image gen-

eration model from a scene graph, we propose an

image generation model from a hyper scene graph

hsg2im. In this Section, we explain the proposed

model hsg2im, and training of hsg2im.

5.1 hsg2im

As explained in Section 1, sg2im (Johnson et al.,

2018) has a shortcoming: the positional relations

among three objects tend to be inaccurate due to the

convolution method.

We address this shortcoming by modifying two

parts from sg2im (Johnson et al., 2018): generating

images from a hyper scene graph with a trinomial

hyperedge and extending the graph convolution net-

work to the hypergraph convolution network. Also, to

generate a higher resolution image, we use a layout-

to-image model Layout2img (He et al., 2021). Lay-

out2img generates images whose resolution is 128

Image Generation from a Hyper Scene Graph with Trinomial Hyperedges

189

×128. In this Section, we explain image generation

using the hypergraph convolution network and Lay-

out2img (He et al., 2021). An overview of the gener-

ator G of hsg2im is shown in Figure 3 (b). The dis-

criminator is unchanged from that of sg2im.

First, we create the hypergraph convolution net-

work by adding a multi-layer perceptron net3, which

is applied to trinomial hyperedges, to the graph con-

volution network of sg2im. The net3 takes vectors

(ν

, ρ

, ν

) as input and outputs vectors (ν

i j

, ρ

′

k j

, ν

l j

) corresponding to a trinomial hyperedge q =

, r

, v

) ∈ Q. The addition of net3 enables the

convolution of the hyper scene graph with trinomial

hyperedges. As a result, three objects are processed

with an application of net3 and two applications of

net1, and their positional relation is expected to be im-

proved. Furthermore, the improved positional relation

reduces the overlapping of objects and is expected to

produce more consistent layouts with the input, which

leads to generating better images. Figure 4 shows the

application of the hypergraph convolution network to

the hyper scene graph in Figure 2 (b).

Next, we describe image generation using Lay-

out2img (He et al., 2021). To address the image res-

olution shortcoming of sg2im, we use the pre-trained

model Layout2img, which generates a 128×128 im-

age from layout L = {(c

, b

)}

i=1

, as an auxiliary

model. {(c

)}

i=1

is obtained from the input (hyper)

scene graph. Bounding box

B = {(b

)}

i=1

is gen-

erated in the framework of sg2im (Johnson et al.,

2018) with the hypergraph convolutional neural net-

work. Layout2img uses the structure of ResNet (He

et al., 2016) in its generator, so that even models

which generate a high resolution image (128 × 128)

can be trained stably.

5.2 Training of hsg2im

The training of generator G and discriminator D is

performed alternately. During the training phase, fea-

ture map M is created from bounding box B in the

dataset. Loss L

of generator G is L

∑

i=1

where w

is the weight of loss L

deﬁned as follows.

• Loss with respect to image pixels: L

= L

pix

• Loss with respect to the bounding box: L

= L

box

• Adversarial loss from discriminator D

img

: L

img

GAN

• Adversarial loss from discriminator D

obj

: L

obj

GAN

• Auxiliarly classiﬁer loss from D

obj

: L

= L

obj

The losses (L

pix

, L

box

, L

img

GAN

, L

obj

GAN

, L

obj

) and their

weights set are the same as in (Johnson et al., 2018).

6 EXPERIMENTS

6.1 Dataset

We use as datasets the 2017 COCO Stuff (Caesar

et al., 2018) and Visual Genome (Krishna et al., 2017)

by adding trinomial hyperedges. The 2017 COCO-

Stuff dataset (COCO) (Caesar et al., 2018) consists

of images, objects, their bounding boxes, and seg-

mentation masks. There are 40,000 training images

and 5,000 validation images. Since the COCO dataset

does not contain data on relations between objects, we

construct a hyper scene graph by adding seven rela-

tions “left of”, “right of”, “above”, “below”, “inside”,

“surrounding”, and “between” based on the bounding

boxes. The “between” relation is added as a trinomial

hyperedge in hsg2im. Since there is no test set in the

COCO dataset, we divide the validation set into a new

validation set and a test set. As a result, the COCO

dataset contains 24,972 training images, 1,667 vali-

dation images, and 3,333 test images.

The Visual Genome (Krishna et al., 2017) version

1.4 (VG) dataset, consisting of 108,077 images anno-

tated with scene graphs, consists of images, objects,

their bounding boxes, and the binomial relations be-

tween the objects. In the entire dataset, only objects

which appear more than 2,000 times and relations

which appear more than 500 times are used following

(Johnson et al., 2018) in our experiments. Samples

with less than 2 objects and more than 31 objects are

ignored. As a result, we use 62,602 training images,

5,069 evaluation images, and 5,110 test images. For

Visual Genome, we add the “between” relation (4,136

hyperedges) as in COCO.

In addition, to compare sg2im with our hsg2im,

we also add the binomial edges “left of” (8,272

edges) corresponding to hyperedges “between” in

both datasets: (v

, “left of”, v

) and (v

, “left of”, v

together which correspond to (v

, “between”, v

, v

Next, we explain how to add hyperedges “be-

tween” and “left of”. For hsg2im, both a trino-

mial hyperedge (v

, “between”, v

, v

) and two bino-

mial edges (v

, “left of”, v

) and (v

, “left of”, v

) are

added for three objects (v

, b

), (v

, b

), (v

, b

) which

satisfy all of the ﬁve conditions in Section 3.3. For

sg2im, only the latter two are added. In Visual

Genome, we add 3,477 hyperedges, 334 hyperedges,

and 325 hyperedges for the training dataset, the val-

idation dataset, and the test dataset, respectively. In

COCO, we add 9,437 hyperedges, 144 hyperedges,

and 227 hyperedges for the training dataset, the vali-

dation dataset, and the test dataset, respectively.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

190

(a)

(b)

Figure 3: Example of generating an image not using Layout2img (He et al., 2021) with generator G of sg2im (Johnson et al.,

2018) (a) and using Layout2img with generator G of hsg2im (b). The elements that we modiﬁed are highlighted in yellow.

Figure 4: Flowchart of one of the ﬁve layers of the hyper graph convolutional network of hsg2im. The part without the yellow

region shows that of the graph convolutional network of sg2im (Johnson et al., 2018). net1, net2, and net3 represent MLPs

for binomial edges, dimension reduction, and trinomial hyperedges, respectively. The detailed structures of net1, net2, and

net3 are given in APPENDIX.

6.2 Experimental Conditions

We train sg2im and hsg2im on the COCO and Visual

Genome datasets, respectively, based on the method

described in Section 5.2. Here, the learning algorithm

(Adam (Kingma and Ba, 2015) ), the learning rate

(= 10

−4

), the batch size (= 32), and the iteration up-

per bound (= 10

) are the same as in the sg2im paper

(Johnson et al., 2018).

6.3 Results and Discussion

First, we quantitatively evaluate hsg2im. The evalua-

tion in terms of PTO is shown in Table 1. We see that

hsg2im shows 12% and 20% higher scores than s2im

on the COCO and VG datasets, respectively. These

results are because the trinomial hyperedge enables

us to capture the positional relations among three ob-

jects.

We show the evaluation in terms of AoO in Table

2. hsg2im shows a higher score on the VG dataset.

Image Generation from a Hyper Scene Graph with Trinomial Hyperedges

191

Table 1: Comparison of PTO scores between sg2im and

hsg2im.

COCO VG

sg2im 0.52 0.19

hsg2im 0.64 0.39

In the COCO dataset, the scores of sg2im and hsg2im

are the same. We speculate that the identical scores

on the COCO dataset are due to the few diversity

of binomial edges on this dataset; sg2im generated

fewer overlapping bounding boxes only with bino-

mial relations. Therefore, the room for improvement

is small. For VG, we can conﬁrm that the addition of a

trinomial hyperedge improves the positional relation

of three objects and reduces the overlapping between

them.

Table 2: Comparison of AoO scores between sg2im and

hsg2im.

COCO VG

sg2im 0.02 0.11

hsg2im 0.02 0.06

We show the evaluation in terms of Inception

Score in Table 3. In the image generated without Lay-

out2img, hsg2im shows a 0.33 lower score for COCO

and a 0.40 higher score than sg2im for Visual Genome

datasets. For images generated using Layout2img,

hsg2im shows 0.26 and 0.10 higher scores than sg2im

for COCO and Visual Genome datasets, respectively.

In the case of not using Layout2img for COCO, the

reason of the degraded IS would be attributed to the

lowness of the resolution. Overall, the scores show

that the images generated with Layout2img are more

natural than those generated by sg2im and more nat-

ural images are generated from layouts produced by

hsg2im than by sg2im.

Table 3: Comparison of Inception Score between sg2im and

hsg2im, and with and without Layout2img.

COCO VG

sg2im 4.17 5.45

hsg2im wo Layout2img 3.84 5.85

sg2im + Layout2img 11.63 9.83

hsg2im + Layout2img 11.89 9.93

Next, we qualitatively evaluate the generated im-

ages. Figure 5 shows the generated images of hsg2im

and sg2im. The scene graph of the input to sg2im in

(a) has a path “sheep” (light blue bounding box) →

“left of” → “sheep” (yellow) → “left of” → “sheep”

(purple). However, we can conﬁrm that “sheep” (yel-

low) is generated at a higher position than “sheep”

(light blue), and the “left of” relation is not satisﬁed.

On the other hand, in the layout generated by hsg2im,

three objects are generated in a positional relation

which satisﬁes the trinomial relation “sheep” (light

blue) → “between” → “sheep” (yellow) → “sheep”

(purple) in the hyper scene graph of the input. Also,

the scene graph of the input to sg2im in (b) has a

path “picture” (purple) → “left of” → “light” (light

blue) → “left of” → “shelf” (blue). However, “light”

and “shelf” bounding boxes are generated on top of

each other, and these objects are missing in the image

(with Layout2img). On the other hand, in the layout

generated by hsg2im, the bounding boxes for “light”

and “shelf” do not overlap each other, and these ob-

jects can be seen in the generated image (with Lay-

out2img). Thus, we can also conﬁrm the effect of the

introduction of the trinomial hyperedge in the gener-

ated images.

6.4 Computational Time

We explain the computational time of the training and

the test phases in each of sg2im and hsg2im. We

used one GPU (NVIDIA TITAN RTX). From Table

4, we can conﬁrm that there is no signiﬁcant differ-

ence between sg2im and hsg2im in both the training

time and the test time. In some cases such as the train-

ing time in Visual Genome and COCO, hsg2im takes

longer time due to the addition of net3 in the proposed

model.

Table 4: Computational time of the training and the test

phases in each of sg2im and hsg2im.

Training Test

sg2im (COCO) 66.1h 127.6m

hsg2im (COCO) 66.4h 127.5m

sg2im (VG) 31.8h 196.1m

hsg2im (VG) 34.8h 191.5m

7 CONCLUSIONS

We have proposed an image generation model hsg2im

from a hyper scene graph. The proposed model is

an extension of the image generation model sg2im

(Johnson et al., 2018) from a scene graph with our ad-

dition of a hyperedge representing a trinomial relation

to the scene graph, in order to improve the positional

relations of the generated objects.

In the evaluation of the Positional relation of

Three Objects (PTO), the performance of hsg2im was

higher by about 12% and 20% on the 2017 COCO

Stuff (COCO) dataset and the Visual Genome dataset,

respectively, than sg2im. In the evaluation of the nat-

uralness of the generated images based on the Incep-

tion Score, the scores improved by about 0.25 for

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

192

(Hyper) Scene Graph

(a)

Layout

(sg2im)

sg2im wo

Layout2img

sg2im with

Layout2img

Layout

(hsg2im)

hg2im wo

Layout2img

hsg2im with

Layout2img

(b)

(c)

(d)

(e)

(f)

(g)

Figure 5: Comparison of images generated by sg2im (Johnson et al., 2018) and hsg2im. (a) ∼ (d) and (e) ∼ (g) are from the

Visual Genome and the MS COCO datasets, respectively. In the input for sg2im, the trinomial hyperedges “between” do not

exist.

Image Generation from a Hyper Scene Graph with Trinomial Hyperedges

193

the COCO dataset and by about 0.40 for the Visual

Genome dataset. In addition, we generated more nat-

ural images by using the layout-to-image model Lay-

out2img supplementally. Although we added only

“between” relation as a trinomial hyperedge, adding

other kinds of hyperedges would lead to generating

more consistent images with the input. Thus it is an

interesting direction for our future work.

REFERENCES

Agnese, J., Herrera, J., Tao, H., and Zhu, X. (2020). A Sur-

vey and Taxonomy of Adversarial Neural Networks

for Text-to-Image Synthesis. Wiley Interdisciplinary

Reviews: Data Mining and Knowledge Discovery,

10(4):e1345.

Caesar, H., Uijlings, J., and Ferrari, V. (2018). Coco-Stuff:

Thing and Stuff Classes in Context. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition, pages 1209–1218.

Chen, Q. and Koltun, V. (2017). Photographic Image Syn-

thesis with Cascaded Reﬁnement Networks. In Pro-

ceedings of the IEEE International Conference on

Computer Vision, pages 1511–1520.

Elgammal, A., Liu, B., Elhoseiny, M., and Mazzone, M.

(2017). CAN: Creative Adversarial Networks Gener-

ating “Art” by Learning About Styles and Deviating

from Style Norms. arXiv preprint arXiv:1706.07068.

Frolov, S., Hinz, T., Raue, F., Hees, J., and Dengel, A.

(2021). Adversarial Text-to-Image Synthesis: A re-

view. arXiv preprint arXiv:2101.09983.

Ghorbani, A., Natarajan, V., Coz, D., and Liu, Y. (2020).

DermGAN: Synthetic Generation of Clinical Skin Im-

ages with Pathology. In Machine Learning for Health

Workshop, pages 155–170. PMLR.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative Adversarial Nets. Advances

in Neural Information Processing Systems, 27.

Gui, J., Sun, Z., Wen, Y., Tao, D., and Ye, J. (2021). A

Review on Generative Adversarial Networks: Algo-

rithms, Theory, and Applications. IEEE Transactions

on Knowledge and Data Engineering.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Resid-

ual Learning for Image Recognition. In Proceedings

of the IEEE Conference on Computer Vision and Pat-

tern Recognition, pages 770–778.

He, S., Liao, W., Yang, M. Y., Yang, Y., Song, Y.-Z., Rosen-

hahn, B., and Xiang, T. (2021). Context-Aware Layout

to Image Generation with Enhanced Object Appear-

ance. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

15049–15058.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and

Hochreiter, S. (2017). GANs Trained by a Two Time-

Scale Update Rule Converge to a Local Nash Equilib-

rium. In NuerIPS.

Hinz, T., Heinrich, S., and Wermter, S. (2019). Generat-

ing Multiple Objects at Spatially Distinct Locations.

ArXiv, abs/1901.00686.

Hinz, T., Heinrich, S., and Wermter, S. (2022). Semantic

Object Accuracy for Generative Text-to-Image Syn-

thesis. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 44:1552–1565.

Johnson, J., Gupta, A., and Fei-Fei, L. (2018). Image

Generation from Scene Graphs. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition, pages 1219–1228.

Johnson, J., Krishna, R., Stark, M., Li, L.-J., Shamma, D.,

Bernstein, M., and Fei-Fei, L. (2015). Image Retrieval

Using Scene Graphs. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition,

pages 3668–3678.

Kingma, D. P. and Ba, J. (2015). Adam: A Method for

Stochastic Optimization. CoRR, abs/1412.6980.

Kingma, D. P. and Welling, M. (2013). Auto-Encoding

Variational Bayes. arXiv preprint arXiv:1312.6114.

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata,

K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J.,

Shamma, D. A., et al. (2017). Visual Genome: Con-

necting Language and Vision Using Crowdsourced

Dense Image Annotations. International Journal of

Computer Vision, 123(1):32–73.

Li, Y., Ma, T., Bai, Y., Duan, N., Wei, S., and Wang,

X. (2019). PasteGAN: A Semi-Parametric Method

to Generate Image from Scene Graph. ArXiv,

abs/1905.01608.

Mirza, M. and Osindero, S. (2014). Conditional Generative

Adversarial Nets. arXiv preprint arXiv:1411.1784.

Mittal, G., Agrawal, S., Agarwal, A., Mehta, S., and Mar-

wah, T. (2019). Interactive Image Generation Using

Scene Graphs. ArXiv, abs/1905.03743.

Nie, D., Trullo, R., Lian, J., Petitjean, C., Ruan, S., Wang,

Q., and Shen, D. (2017). Medical Image Synthe-

sis with Context-Aware Generative Adversarial Net-

works. In International Conference on Medical Im-

age Computing and Computer-Assisted Intervention,

pages 417–425. Springer.

Odena, A., Olah, C., and Shlens, J. (2017). Conditional Im-

age Synthesis with Auxiliary Classiﬁer GANs. In In-

ternational Conference on Machine Learning, pages

2642–2651. PMLR.

Radford, A., Metz, L., and Chintala, S. (2015). Un-

supervised Representation Learning with Deep Con-

volutional Generative Adversarial Networks. arXiv

preprint arXiv:1511.06434.

Reed, S. E., Akata, Z., Yan, X., Logeswaran, L., Schiele,

B., and Lee, H. (2016). Generative Adversarial Text

to Image Synthesis. ArXiv, abs/1605.05396.

Rezatoﬁghi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid,

I., and Savarese, S. (2019). Generalized Intersection

over Union: A Metric and a Loss for Bounding Box

Regression. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 658–666.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

194

stein, M., et al. (2015). ImageNet Large Scale Vi-

sual Recognition Challenge. International Journal of

Computer Vision, 115(3):211–252.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V.,

Radford, A., and Chen, X. (2016). Improved Tech-

niques for Training GANs. Advances in Neural Infor-

mation Processing Systems, 29.

Schuster, S., Krishna, R., Chang, A., Fei-Fei, L., and Man-

ning, C. D. (2015). Generating Semantically Precise

Scene Graphs from Textual Descriptions for Improved

Image Retrieval. In Proceedings of the Fourth Work-

shop on Vision and Language, pages 70–80.

Sun, W. and Wu, T. (2019). Image Synthesis From Recon-

ﬁgurable Layout and Style . In Proceedings of the

IEEE/CVF International Conference on Computer Vi-

sion, pages 10531–10540.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going Deeper with Convolutions.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 1–9.

Van Den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K.

(2016). Pixel Recurrent Neural Networks. In Interna-

tional Conference on Machine Learning, pages 1747–

1756. PMLR.

Vo, D. M. and Sugimoto, A. (2020). Visual-Relation Con-

scious Image Generation from Structured-Text. ArXiv,

abs/1908.01741.

Wu, X., Xu, K., and Hall, P. (2017). A Survey of Im-

age Synthesis and Editing with Generative Adver-

sarial Networks. Tsinghua Science and Technology,

22(6):660–674.

Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang,

X., and Metaxas, D. N. (2017). StackGAN: Text to

Photo-Realistic Image Synthesis with Stacked Gen-

erative Adversarial Networks. In Proceedings of the

IEEE International Conference on Computer Vision,

pages 5907–5915.

Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X.,

and Metaxas, D. N. (2018). Stackgan++: Realistic

Image Synthesis with Stacked Generative Adversarial

Networks. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 41(8):1947–1962.

APPENDIX

We show the detailed structure of hg2sim. Fig. 6,7,

and 8 are net1, net2 and net3, which are used in the

(hyper) graph convolutional network, respectively.

Figure 6: Structure of net1. It receives two 128-dimensional

object vectors and one relation vector corresponding to e ∈

E, and outputs two 512-dimensional object vectors and one

128-dimensional relation vector.

Figure 7: Structure of net2. It receives one 512-dimensional

object vector, which is the output of net1 and net3, and out-

puts a 128-dimensional object vector by conducting dimen-

sionality reduction.

Figure 8: Structure of net3. It receives three 128-

dimensional object vectors and one relation vector corre-

sponding to q ∈ Q, and outputs three 512-dimensional ob-

ject vectors and one 128-dimensional relation vector.

Image Generation from a Hyper Scene Graph with Trinomial Hyperedges

195