Skip to ContentSkip to Navigation
About us Faculty of Science and Engineering PhD ceremonies

Generative adversarial networks for diverse and explainable text-to-image generation

PhD ceremony:Z. (Zhenxing) Zhang, MSc
When:February 07, 2023
Supervisor:prof. dr. L.R.B. (Lambert) Schomaker
Co-supervisor:S.H. (Hamidreza) Mohades Kasaei, PhD
Where:Academy building RUG
Faculty:Science and Engineering

This thesis focuses on algorithms for text-to-image generation, which aim at yielding photo-realistic and semantically consistent pictures on the basis of a natural-language description. Chapter 1 provides a brief general introduction of research in image synthesis based on linguistic (textual) descriptions. In Chapter 2, we propose the Dual-Attention Generative-Adversarial Network (DTGAN) which can produce perceptually plausible pictures from given natural-language descriptions, only employing a single generator/discriminator network pair. Chapter 3 intends to deal with the lack-of-diversity issue present in current single-stage text-to-image generation models. To tackle this problem, we improve on DTGAN with an efficient and effective single-stage framework (DiverGAN) to yield diverse, photo-realistic and semantically related images according to a single natural-language description and different latent vectors. In Chapter 4, we constructed novel data sets for ‘Good vs Bad’ data consisting of successful as well as unsuccessful synthesized samples of birds and of human faces. For these, special classifiers were trained to ensure that generated images are natural, realistic and believable. In Chapter 5 and Chapter 6, we investigate the latent space and the linguistic space of a conditional text-to-image GAN model for an improved explainability of the results in the generation process. More specifically, we explore the relationship between the latent control space and the obtained image variation by conducting an independent-component analysis algorithm on pretrained weight values of the generator. Furthermore, we qualitatively analyze the roles played by ‘linguistic’ embeddings in the synthetic-image semantic space, by using linear and triangular interpolation between keywords.