Visualizing Language: Transformer-CNN Fusion for Generating Photorealistic Images

Danish N Mulla; Devaunsh Vastrad; Shahin Mirakhan; Atharva Patil; Uday Kulkarni

doi:10.5220/0013610700004664

Visualizing Language: Transformer-CNN Fusion for Generating Photorealistic Images

Danish N Mulla, Devaunsh Vastrad, Shahin Mirakhan, Atharva Patil, Uday Kulkarni

2025

Abstract

The task of generating meaningful images based on text descriptions includes the two major components of natural language understanding and visual synthesis. Recently, in this domain, transformer-based models with convolutional neural networks(CNNs) have gained immense prominence. This paper presents a hybrid approach using a Transformer-based text encoder combined with a CNN-based image feature extractor, Incep-tionV3, to create images corresponding to textual descriptions. The model takes as input the text and passes it through a transformer encoder to capture contextual and semantic information. In parallel, high-level visual features are extracted from the COCO data provided by CNN. The Image Decoder then decodes these features into synthesized images based on the input text. Sparse categorical cross-entropy loss is employed to reduce the distance between generated and reference images during the training regime, and data augmentation is used to enhance generalization. The results show an exceptionally good alignment accuracy of 72 percent between text and images, a Fréchet Inception Distance of 18.2 and an Inception Score of 5.6. High-definition images were generated for prompts such as ”A Policeman Riding a Motorcycle”; the others showed diversity according to the prompts provided, for instance, ”Assorted Electronic Devices” and ”A Man Riding a Wave on Top of a Surfboard.”, a future challenge will be to generate surface textures from abstract descriptions, which can be tackled in subsequent work.

Download

Paper Citation

in Harvard Style

Mulla D., Vastrad D., Mirakhan S., Patil A. and Kulkarni U. (2025). Visualizing Language: Transformer-CNN Fusion for Generating Photorealistic Images. In Proceedings of the 3rd International Conference on Futuristic Technology - Volume 3: INCOFT; ISBN 978-989-758-763-4, SciTePress, pages 160-166. DOI: 10.5220/0013610700004664

in Bibtex Style

@conference{incoft25,
author={Danish Mulla and Devaunsh Vastrad and Shahin Mirakhan and Atharva Patil and Uday Kulkarni},
title={Visualizing Language: Transformer-CNN Fusion for Generating Photorealistic Images},
booktitle={Proceedings of the 3rd International Conference on Futuristic Technology - Volume 3: INCOFT},
year={2025},
pages={160-166},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013610700004664},
isbn={978-989-758-763-4},
}

in EndNote Style

TY - CONF

JO - Proceedings of the 3rd International Conference on Futuristic Technology - Volume 3: INCOFT
TI - Visualizing Language: Transformer-CNN Fusion for Generating Photorealistic Images
SN - 978-989-758-763-4
AU - Mulla D.
AU - Vastrad D.
AU - Mirakhan S.
AU - Patil A.
AU - Kulkarni U.
PY - 2025
SP - 160
EP - 166
DO - 10.5220/0013610700004664
PB - SciTePress