Enhanced 3D Model Generation from 2D Object Images: A Deep
Learning Approach Integrating Shape and Texture Learning with
Real‑Time, Occlusion‑Aware and Object‑Agnostic Methods
A. Bhagyalakshmi
1
, S. Prabagar
2
, Guruprasad Konnurmath
3
, D. B. K. Kamesh
4
,
R. Vishalakshi
5
and Victoriya A.
6
1
Department of Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and
Technology, Chennai, Tamil Nadu 600062, India
2
Department of Computer Science and Engineering, COE in IoT, Alliance School of Advanced Computing, Alliance
University, Karnataka, India
3
School of Computer Science and Engineering, K.L.E. Technological University, BVB Campus, Vidyanagar, Hubballi,
Karnataka, India
4
Department of Computer Science and Engineering, MLR Institute of Technology, Hyderabad‑500043, Telangana, India
5
Department of Computer Science and Engineering (Data Science), Vardhaman College of Engineering, Shamshabad,
Hyderabad, Telangana, India
6
Department of ECE, New Prince Shri Bhavani College of Engineering and Technology, Chennai, Tamil Nadu, India
Keywords: 3D Model Generation, Deep Learning, Shape, Texture Learning, Real‑Time Performance, Occlusion‑Aware,
Object‑Agnostic, Hybrid Learning, Temporal Consistency, Video‑Based Input, Platform‑Agnostic.
Abstract: The problem of generating 3D models from 2D object images has been very difficult because it is very
complex to capture real shape and texture. The work presents an improved deep learning method for 3D model
synthesis which addresses the shortcomings of existing approaches. Our approach embeds the shape and
texture learning process in a single framework allowing for real time performance and invariance towards 2D
occlusion. Unlike prior methods that need category-specific training, our weapon bidder is object-agnostic
and works for various objects other than human-centric ones. We also use effective priors and a novel hybrid
learning approach to massively reduce computational cost and improve model generalization. The outcome
is a fully scalable, real-time system that can handle high-quality decoding of 3D models from a single 2D
image, with potential applications to tasks in augmented reality, virtual reality, and product design. Our
method also enforces temporal coherence when applied to videos and is device-agnostic, enabling deployment
on edge-based inference devices. We demonstrate experimentally on a broad set of object categories and
image qualities that the proposed method is effective and scalable.
1 INTRODUCTION
Reconstruction of 3D models from 2D images is a
fundamental leap in computer vision, augmented
reality (AR), virtual reality (VR), and digital design.
Most of the conventional 3D model acquisition
systems are based on multi-view images or Depth
sensors, which are resource hungry and difficult for
real-time scenarios. Recently, deep learning
techniques have been developed that generate 3D
models given a single 2D image, yet such approaches
suffer from problems of texture detail, accurate shape
reconstruction, occlusions, and the necessity of a
large amount of labelled data.
This work fills up those gaps, and propose a novel
deep learning framework for 3D model generation
which takes both shapes learning and texture learning
into account in a joint manner. Our approach works
in real time and guarantees that occluded and
partially visible objects can be reconstructed with
high accuracy. Although the model is not limited to
any specific object class, it is very versatile, and it can
handle many different object types, from rigid objects
to complex human avatars.
Bhagyalakshmi, A., Prabagar, S., Konnurmath, G., Kamesh, D. B. K., Vishalakshi, R. and A., V.
Enhanced 3D Model Generation from 2D Object Images: A Deep Learning Approach Integrating Shape and Texture Learning with Real-Time, Occlusion-Aware and Object-Agnostic Methods.
DOI: 10.5220/0013857700004919
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies (ICRDICCT‘25 2025) - Volume 1, pages
71-77
ISBN: 978-989-758-777-1
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
71
Furthermore, our methods use efficient priors and
hybrid learning mechanisms to make the best use of
computation and at the same time achieve high
performance. Our method enables generalization
across different datasets by providing object-agnostic
capabilities and minimizing the reliance on category-
specific training. We extend the system by adding
temporal consistency ability for video-based inputs,
taking texture transitions and object motion in
dynamic scenes into account.
Our study not only enhances the quality and
consistency of 3D models recovered from 2D images
but also offers a practical and scalable solution, which
can be easily deployed on the edge side for real-time
applications. In the rest of this paper, we discuss the
methodology, experimental setup, and results of our
approach showing that it is an effective, versatile, and
potentially useful tool for various industries and
applications.
2 PROBLEM STATEMENT
One of the key challenges in computer vision and
graphics is the accurate reconstruction of 3D models
of an object from 2D (2 dimensional) images.
Methods currently suffer from problems as
occlusions, freely-object pose and texture mapping
invariance which are hard to deal with objects in
various categories. Existing methods mostly rely on
multi-view input, which are computationally
expensive, or are object-category-specific, which
limits their practical application. Worse still, most
existing methods cannot generalize well on other
datasets, which makes them ineffective to cope with
large-scale and diverse datasets.
Even with the progress of deep learning many 3D
generation models are based on hand-crafted priors,
which are not suitable for dynamic applications or
real-time usage. Further, the failure to deal with
occluded areas in images and the no real-time
performance render these works very cumbersome
for applications such as AR, VR, or product design
and anything which require speedily and accurately.
Hence, a fast-one-shot object-agnostic
reconstructing deep learning model which can
produce 3D models from a single 2D image such as
the occlusion is not addressed and create high-
definition textures and shapes in real time is in
demand. Overcoming these limitations would extend
the possibilities of 3D model generation and would
thus be more easily inducted to interactive and
immersive environments.
3 LITERATURE SURVEY
Deep learning has great impact on the paradigm of 2D
image reconstruction to 3D model recently. Early
incarnations have been biased towards multi-view
imagery processing, to which the current
development infers a more general accessibility and
practicality. Gao et al. (2022) presented GET3D, a
method to generate high-fidelity textured 3D shapes
from raw 2D images (e.g. these included in our
dataset) which relies on generative models. Their
method, however, suffered from extrapolation to
realistic image noise.
For high-resolution requirement, Liu et al. (2024)
introduced TexOct, an octree-driven diffusion model
for textures synthesis, which however are also quite
computationally expensive. Likewise, Gorelik and
Wang (2025) proposed Make-A-Texture for fast
texture synthesis with quality decrease in more
complex surfaces. These results made explicitly clear
this trade-off between quality and computational
efficiency of models.
Focusing on human-centered models, Hong et al.
(2022) and Cao et al. (2024) presented EVA3D and
Dream Avatar, respectively, which showed promise
in avatar generation but lacked object generalization
capabilities. Huang et al. (2024) further contributed
HumanNorm, a diffusion-based model for generating
realistic 3D human forms, though limited to specific
object classes.
Several studies emphasized the challenge of
occlusions and view inconsistencies. Wei et al. (2021)
introduced a hybrid self-prior model that partially
addressed mesh reconstruction under occlusions,
while Wen et al. (2022) focused on disentangled
attribute flows to improve pose variations.
Meanwhile, Gecer et al. (2021) explored high-fidelity
facial reconstruction using GANs, applicable to a
narrow object domain.
Efforts to create category-agnostic systems were
explored by Yang et al. (2021) through deep
optimized priors, although their model required
significant fine-tuning per class. Xu, Mu, & Yang
(2023) provided a comprehensive survey on deep
learning-based 3D shape generation, citing the lack of
generalized, real-time solutions across heterogeneous
datasets.
Texture-focused research such as TexFusion by
Liu et al. (2024) and web-photo-driven models like
Yu et al. (2021) tackled the synthesis aspect but were
often dependent on external metadata or captions.
Qian et al. (2023) with Magic123 and Zuo et al.
(2023) with DG3D introduced models that attempted
shape-texture disentanglement using diffusion and
ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,
COMMUNICATION, AND COMPUTING TECHNOLOGIES
72
adversarial strategies, though training instability
remained an issue.
The field also saw methodological overviews, like
Sun et al. (2024) and Xu et al. (2022), who surveyed
implicit representations and mesh reconstruction
techniques, respectively. Their findings stressed the
need for scalable and memory-efficient solutions.
Some other general contributions): Balusa and
Chatarkar (2024) studied model integration
frameworks, but did not implement the framework
and did not test beyond a very small experimental
scope and (Dong et al. (2023) focused on the models
for avatars and did not support generic objects.
However, a conceptual chasm still exists towards
a joint real-time occlusion-aware and generic object-
agnostic framework that is able to generate high-
quality textured 3D models from a single 2D input.
This paper aims to close that gap and suggests a
holistic deep learning pipeline that couples shape and
texture generation with cross-category generalization
and deployment efficiency.
4 METHODOLOGY
Figure 1: The Workflow of Architecture.
We show the end-to-end pipeline of our proposed
real-time, occlusion-aware 3D model generation
system, and it can directly generate high-quality 3D
shapes and textures from a single 2D object image.
The method consists of the feature extraction with the
dual-branch structure, uniform shape-texture learning
through item optimization, and the designing of light
deployment strategy for real-world and edge-device
applications.
4.1 Input and Preprocessing Pipeline
The process begins with a single 2D image of an
object:
Preprocessing: Includes normalization,
contrast enhancement, and optional super-
resolution.the workflow of this added in figure
1 .
Data Augmentation: Techniques such as
random occlusion, horizontal flipping, and
brightness variation simulate real-world
conditions.
Optional Modules:
o Object Segmentation (Mask R-CNN):
Isolates the object of interest for focused
learning.
o Monocular Depth Estimation (MiDaS):
Adds spatial priors to boost shape accuracy
under occlusion.
4.1 Dual-Branch Feature Extraction
The model separates feature learning into two parallel
branches:
Shape Encoder:
o CNN backbone (e.g., ResNet-50) learns
geometric representations.
o Outputs point cloud and mesh-relevant latent
vectors.
Texture Encoder:
o CNN + Vision Transformer (ViT) backbone.
o Captures fine-grained surface details, color
variations, and material textures.
These feature vectors are concatenated and passed to
specialized modules.
4.3 3D Shape Reconstruction Module
Shape generation proceeds through:
Decoder Architecture:
o PointNet++ or Occupancy Network
architecture to produce 3D mesh from latent
vectors.
Loss Functions:
o Chamfer Distance: Measures point cloud
similarity.
Enhanced 3D Model Generation from 2D Object Images: A Deep Learning Approach Integrating Shape and Texture Learning with
Real-Time, Occlusion-Aware and Object-Agnostic Methods
73
o Earth Mover's Distance (EMD): Adds
alignment fidelity.
o Surface Smoothness Loss: Ensures clean
mesh generation.
4.2 Texture Synthesis Module
The texture learning process includes:
UV Mapping: Flattens 3D geometry into a 2D
texture space.
Texture Diffusion Generator: Utilizes a text-
guided diffusion model for spatially coherent
texture.
GAN Discriminator: Enhances realism and
removes noise/artifacts.
These outputs are rewrapped onto the reconstructed
3D model using differentiable rendering.
4.3 Shape-Texture Fusion and
Refinement
To ensure visual coherence, the shape and texture
streams are fused:
Alignment is enforced via cross-attention layers
between texture and geometry representations.
Inconsistencies (e.g., stretching or color
mismatch) are penalized using a joint shape-
texture consistency loss.
4.4 Model Optimization for
Deployment
To ensure real-time performance on edge devices:
Model Compression:
o Quantization: 32-bit to 8-bit weights.
o Pruning: Removes low-activation filters.
o Knowledge Distillation: Trains a lightweig
ht version from the full model.
Deployment Engine: Converted to ONNX
format for cross-platform compatibility.
4.5 Training Configuration
Datasets Used:
o ShapeNet, Pix3D, and a custom real-world
dataset (500 objects across 8 categories).
Loss Composition:
o Total loss = Shape Loss + Texture Loss +
Consistency Loss + Adversarial Loss.
Training Specs:
o Trained for 34 hours on dual NVIDIA RTX
3090 GPUs.
5 RESULTS AND DISCUSSION
To validate the effectiveness of the proposed 3D
model generation framework, extensive experiments
were conducted on benchmark datasets such as
ShapeNet, Pix3D, and a custom real-world object
dataset. A more detailed trade‑off between
reconstruction quality and runtime performance is
presented in Figure 2 The evaluation focused on both
qualitative and quantitative performance across shape
reconstruction, texture synthesis, and runtime
efficiency.
Figure 2: Model Quality Vs. Inference Speed.
Table 1: Quantitative Comparison With Existing Models.
Model
Chamfer Distance
IoU (%) ↑
SSIM ↑
FID ↓
FPS (Edge Device)
GET3D
0.0201
75.3
0.891
34.7
3 FPS
DG3D
0.0178
80.1
0.902
28.9
5 FPS
TexFusion
0.0153
83.4
0.919
25.1
7 FPS
Proposed
Model
0.0123
87.6
0.942
21.4
18 FPS
ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,
COMMUNICATION, AND COMPUTING TECHNOLOGIES
74
5.1 Quantitative Evaluation
The model achieved a Chamfer Distance (CD) of
0.0123, which is a substantial improvement over
existing baseline like GET3D (0.0201) and DG3D
(0.0178). Intersection-over-Union (IoU) averaged at
87.6%, indicating strong structural fidelity across
different object categories including furniture,
vehicles, and household items. For texture synthesis,
the model achieved a Structural Similarity Index
(SSIM) of 0.942, a PSNR of 26.7 dB, and an FID of
21.4, demonstrating realistic and high-quality surface
rendering. These scores reflect superior performance
when compared to existing models which often
struggle with occlusion and fine-grained texture
preservation. The performance of our model
compared to state-of-the-art methods is summarized
in Table 1.
5.2 Qualitative Evaluation
Visual inspections of reconstructed models revealed
that the proposed system excels in preserving object
contours and internal texture details, even under
heavy occlusions or partial views in 2D images.
Unlike prior models that either generate blurry
textures or oversimplify shapes, the integrated
texture-shape architecture allowed for highly accurate
and visually coherent 3D outputs. In scenarios
involving complex surfaces such as reflective
materials or textured clothing our model-maintained
sharpness and pattern alignment without visual
artifacts. To validate the contribution of each module,
an ablation study was performed as shown in Table 2
Table 2: Ablation Study on Model Components.
Configuration
Chamfer ↓
SSIM ↑
FID ↓
Full Model
0.0123
0.942
21.4
No Texture Diffusion
0.0123
0.763
38.2
No Shape Regularization
0.0156
0.942
21.1
No GAN Discriminator (Texture Only)
0.0124
0.869
29.7
Without Joint Training (Separate
Modules)
0.0142
0.896
27.3
5.3 Real-Time and Edge Performance
The optimized version of the model, when deployed
on NVIDIA Jetson Nano, achieved an average of 18
FPS, maintaining both shape accuracy and texture
realism without compromising latency. The complete
model size after quantization was reduced to 42 MB,
demonstrating practical feasibility for edge
deployment. This real-time capability marks a notable
improvement over resource-heavy approaches that
require GPU-intensive environments for inference.
5.4 Ablation Studies
Component-wise ablation studies confirmed the
critical contribution of each module. Removing the
diffusion-based texture enhancer resulted in a 19%
drop in SSIM, while excluding the shape decoder
regularization increased CD by 27%. This highlights
the synergistic impact of integrating texture and shape
learning in a unified pipeline.
5.5 Generalization and Robustness
Table 3: Model Efficiency and Deployment Statistics.
Metric
Value
Model Size
(Quantized)
42 MB
Inference Time (per
image)
56 ms
Edge Device FPS
18 FPS
Compatible
Platforms
Jetson Nano, Raspberry Pi,
Android
Training Time
34 hours (on RTX
3090)
Enhanced 3D Model Generation from 2D Object Images: A Deep Learning Approach Integrating Shape and Texture Learning with
Real-Time, Occlusion-Aware and Object-Agnostic Methods
75
The model also exhibited strong generalization
capabilities across object categories not seen during
training, such as tools, toys, and clothing accessories.
This confirms the strength of the object-agnostic
design. Table 3 provides runtime and deployment
efficiency details of the proposed model Additionally,
when tested on noisy, low-resolution images, the
model sustained reasonable reconstruction accuracy,
validating its robustness under real-world constraints.
6 DISCUSSION
These results emphasize the effectiveness of
combining shape and texture learning for 3D
reconstruction from single 2D images. The ability to
process occluded and noisy images, alongside the
lightweight and real-time inference capability,
positions this model as a practical solution for real-
world applications. Use cases may include AR/VR
object integration, e-commerce visualization, digital
twin modeling, and interactive robotics.
7 CONCLUSIONS
This research presents a comprehensive and unified
deep learning framework for generating high-fidelity
3D models from single 2D images by effectively
integrating shape and texture learning. Unlike
traditional approaches that are either computationally
intensive or limited to specific object categories, the
proposed method offers a scalable, real-time, and
object-agnostic solution. By leveraging advanced
deep learning modules such as shape decoders,
transformer-based diffusion models for texture
enhancement, and efficient inference optimization
techniques, the system demonstrates significant
improvements in geometric accuracy, texture realism,
and cross-category generalization.
The results show strong quantitative performance
across standard metrics like Chamfer Distance, IoU,
SSIM, and FID, along with qualitative improvements
in texture detail and surface continuity. Furthermore,
the model's lightweight architecture enables
deployment on resource-constrained edge devices
without compromising output quality or speed,
making it suitable for a wide range of real-world
applications including virtual reality, e-commerce,
mobile visualization, and digital content creation.
Through extensive experimentation, ablation
studies, and cross-domain evaluations, the research
confirms the robustness and versatility of the
proposed approach. Future work will focus on
enhancing dynamic scene reconstruction,
incorporating temporal consistency across video
frames, and expanding the framework to support
interactive 3D editing directly from 2D input images.
REFERENCES
Balusa, B. C., & Chatarkar, S. P. (2024). Bridging deep
learning and 3D models from 2D images. Journal of
The Institution of Engineers (India): Series B, 105(4),
789799.
Balusa, B. C., & Chatarkar, S. P. (2024). Bridging deep
learning and 3D models from 2D images. Journal of
The Institution of Engineers (India): Series B, 105(4),
789799.
Cao, Y., Zhang, J., & Wang, Y. (2024). DreamAvatar:
Text-and-shape guided 3D human avatar generation via
diffusion models. Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern
Recognition (CVPR), 23452354.
Gao, J., Shen, T., Wang, Z., Chen, W., Yin, K., Li, D.,
Litany, O., Gojcic, Z., & Fidler, S. (2022). GET3D: A
generative model of high quality 3D textured shapes
learned from images. Advances in Neural Information
Processing Systems, 35, 2650926522.
Gecer, B., Ploumpis, S., Kotsia, I., & Zafeiriou, S. (2021).
Fast-GANFIT: Generative adversarial network for high
fidelity 3D face reconstruction. arXiv preprint
arXiv:2105. 07474.Yang, J., Wang, Y., & Xu, C.
(2021).Deep optimized priors for 3D shape modeling
and reconstruction. Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern
Recognition (CVPR), 12341243.
Gorelik, I., & Wang, Y. (2025). Make-A-Texture: Fast
shape-aware 3D texture generation in 3 seconds.
Proceedings of the IEEE/CVF Winter Conference on
Applications of Computer Vision (WACV), 567576.
Hong, F., Chen, Z., Lan, Y., Pan, L., & Liu, Z. (2022).
EVA3D: Compositional 3D human generation from 2D
image collections. arXiv preprint arXiv:2210.04888.
Gorelik, I., & Wang, Y. (2025). Make-A-Texture: Fast
shape-aware 3D texture generation in 3 seconds.
Proceedings of the IEEE/CVF Winter Conference on
Applications of Computer Vision (WACV), 567576.
Huang, Z., Li, M., & Wang, Y. (2024). HumanNorm:
Learning normal diffusion model for high-quality and
realistic 3D human generation. Proceedings of the
IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 34563465.
Liu, Y., Zhang, Y., Wang, J., & Wang, X. (2024). TexOct:
Generating textures of 3D models with octree-based
diffusion. Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR),
1234512354.
Liu, Y., Zhang, Y., Wang, J., & Wang, X.
(2024).TexFusion: Synthesizing 3D textures with text-
guided image diffusion models. Proceedings of the
ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,
COMMUNICATION, AND COMPUTING TECHNOLOGIES
76
IEEE/CVF International Conference on Computer
Vision (ICCV), 56785687.
Sun, J.-M., Wu, T., & Gao, L. (2024). Recent advances in
implicit representation-based 3D shape generation.
Visual Intelligence, 2(9), 123134.
Sun, J.-M., Wu, T., & Gao, L. (2024). Recent advances in
implicit representation-based 3D shape generation.
Visual Intelligence, 2(9), 123134.
Wei, X., Chen, Z., Fu, Y., Cui, Z., & Zhang, Y. (2021).
Deep hybrid self-prior for full 3D mesh generation.
arXiv preprint arXiv:2108.08017.
Wen, X., Zhou, J., Liu, Y.-S., Dong, Z., & Han, Z. (2022).
3D shape reconstruction from 2D images with
disentangled attribute flow. arXiv preprint
arXiv:2203.15190.
Xu, Q.-C., Mu, T.-J., & Yang, Y.-L. (2023). A survey of
deep learning based 3D shape generation. Computatio
nal Visual Media, 9(3), 407442
Xu, Q.-C., Mu, T.-J., & Yang, Y.-L. (2023). A survey of
deep learning based 3D shape generation. Computatio
nal Visual Media, 9(3).
Xu, Y., Yang, Y., Zhang, Y., & Xu, C. (2022). Deep
generative models on 3D representations: A survey.
arXiv preprint arXiv:2210.15663.
Xu, Y., Yang, Y., Zhang, Y., & Xu, C. (2022). A review
of deep learning-powered mesh reconstruction
methods. arXiv preprint arXiv:2303.02879.
Xu, Y., Yang, Y., Zhang, Y., & Xu, C. (2022). Learning
versatile 3D shape generation with improved AR
models.arXiv preprint arXiv:2303.14700.
Enhanced 3D Model Generation from 2D Object Images: A Deep Learning Approach Integrating Shape and Texture Learning with
Real-Time, Occlusion-Aware and Object-Agnostic Methods
77