Improving Text-to-Image Generation with Multimodal Semantic Coherence in Adversarial Training

Research in the field of text to image generation has shown incredible momentum owing to the availability of more powerful natural language processing (NLP) models and generative networks. The quality of data representation learned by the generative models acts as a determining factor in the success of these models. Self-supervised learning augments the generative power of these networks by utilizing the underlying hidden structure of the data for providing supervisory signals. Contrastive learning (CL), a self-supervised technique, has been used in generative models to foster improvements in image to image and text to image (T2I) tasks. Use of generative adversarial networks (GAN) is not nascent in the field of text to image generation. But, GANs suffer from the problem of training instability. In T2I models, the existence of numerous mappings between the image and text captions adds more to this training instability and puts the adversarial loss under another constraint. Several T2I models have been proposed in the literature which have employed CL with the intent to stabilize the GAN training and improve semantic consistency of generated images and textual captions. But most of these models have used stacked architecture as baseline and attention computations for ensuring the semantic consistency of image and text. This setup becomes computationally more expensive as the resolution of the generated images increases. In this work we have employed CL in a single stage GAN for improving the convergence of generative model to a better learnt latent data representation. Comprehensive experiments upon benchmark datasets have shown remarkable improvement in the convergence rate of model when co-related with other similar state-of-the-art models.