Text-to-Image: How AI Illustrates the War in Ukraine and What You Need to Generate Your Own Art

Hi! My name is Oleksii Avilov. I work as an ML engineer at Ukrainian startup ZibraAI. I am engaged in research, automation, and the development of ML solutions. I mainly work on image-to-3D generation tasks, and now also text-to-image.

Many 3D artists and illustrators believe that artificial intelligence will eventually replace them. I look at this challenge differently. At ZibraAI, we develop solutions to simplify the work of game designers using artificial neural networks. We don’t want to replace people, but strive to help them by making the most time-consuming and boring processes faster.

For the past few years, I have been fascinated by generative art – the art generated with the help of artificial intelligence.

At the beginning of last year, we worked on 3D model generation and tested different approaches, including text-to-image generation. It started rather as a hobby, and it probably would have remained a hobby, if not for the war.

In this blog, I want to cover different approaches to generating images using artificial intelligence and share the story of how we used generative art to draw attention to the war in Ukraine.

The dawn of generating images from text

The first attempts to generate images from text began in the mid-2010s, with the appearance of Generative Adversarial Networks (GANs).

Generative Adversarial Network is a system of two artificial neural networks that compete with each other. One network (the generator) generates images based on textual descriptions, while another (the discriminator) evaluates them.

During the training phase, the goal of the generator is to trick the discriminator by creating a synthesized image as similar to the real one as possible. The task of the discriminator is to accurately distinguish real images from synthesized ones.

Here is an example of such a generation created in 2016:

It was a breakthrough then, although it certainly looks pretty bad now

Later several alternative algorithms appeared that generated images from text queries, but there was no visible improvement in terms of quality compared to GANs. A good overview of image generation from long ago to today can be found here on GitHub.

The next stage of development – DALL·E and CLIP

Only at the beginning of last year, there came major changes in the field of text-to-image. OpenAi introduced two solutions which, in my opinion, gave the start to the revolution in image generation that we observe now. They were DALL·E and CLIP neural networks.

DALL·E neural network is based on GPT-3, the third generation of natural language processing algorithm from OpenAi, and it has a transformer architecture that extends text sequences with special image tokens, which are then transformed into images by another model (decoder).

Compared to previous solutions, DALL·E showed great progress in terms of synthesized images quality. The level of generalization which characterizes the neural network deserves special attention. Thanks to it, DALL·E can generate samples that it hasn’t seen during training (examples that are missing from the training dataset).

For example, these avocado-shaped chairs have become OpenAI's signature.

[https://openai.com/blog/dall-e/>

CLIP has got less hype upon release, but I believe it made a more significant contribution to the development of text-to-image generation. CLIP perfectly links images to text and consists of two encoders, one for text and one for images.

Unlike DALL·E, the trained CLIP weights have been made freely available by the developers. After that, many CLIP-based text-to-image solutions (like BigGAN+CLIP, VQGAN+CLIP, CLIP Guided Diffusion, and others) have been introduced.

And since then, something crazy has been going on in this industry.

There’s an ongoing competition between open source solutions from the community of independent developers (like Disco Diffusion, Latent Diffusion, Stable Diffusion) and commercial models of large (some of them not) enterprises (DALL·E 2 from OpenAI, Imagen from Google Research, Midjourney, and others). More and more generative content appears in the world.

Midjourney. Source

DALL·E 2 generations:

Image sources in the carousel: [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12].

Imagen generations:

Image sources in the carousel: [1], [2], [3], [4], [5], [6].

Stable Diffusion generations:

Image sources in the carousel: [1], [2], [3], [4], [5], [6], [7], [8], [9].

Here are some cool GIFs generated in Stable Diffusion:

Although big corporations usually have access to more computing power, many generative artists prefer open source solutions. The advantage of such solutions is that everyone can participate in product development and try it, with the technologies themselves evolving and developing quickly.

Moreover, the choice between open source products and commercial solutions is often due to censorship. For example, OpenAI prohibits generating images on sensitive topics and bans users after several attempts to generate them. The word "Ukraine" is also prohibited for text input.

Censorship in DALL·E 2

Generation of images about the war in Ukraine

In the first week of the full-scale war, our team decided to use our expertise to help raise funds to rebuild cities destroyed by Russia and to remind the world that the war was not over. That’s how the Sirens Gallery project was born. . Upon getting started we conducted research and chose the approach based on Disco Diffusion.

At that time, version 4.1 was in public access, and now 5.6 has already been released. Disco Diffusion is based on the class-conditional diffusion model from OpenAI with CLIP, which links text prompts with images (you can play with the artificial neural network here).

Day 1 — Ghost of Kyiv**

We've explored the parameters, available models, and art styles and started generating pictures (you can find a parameters guide here, and Disco Diffusion styles guide here). To get a better image quality, a super-resolution model has been added to the pipeline. Furthermore, we made a convenient interface for internal needs. Now everything works in the Telegram bot.

But in fact, the technical part was not the hardest one. The biggest challenge was to make the war timeline.

In order to choose the most important events, and write stories about them, we had to review a lot of events and photos. You can't help but take it to heart when you read about yet another atrocity of Russian soldiers, another story of raped women and children, or look at photos from Bucha or Izyum, where the streets are littered with the corpses of civilians.

In total, Sirens Gallery consists of more than2,000 pictures generated by artificial intelligence based on textual descriptions of the most important war events. 1,991 paintings have been offered for sale as NFTs on the Opensea.io and Paras.id platforms. At the moment, we have already sold part of the paintings with an overall cost of 250,000 UAH (~$7,000). We transferred these funds to three charity projects on Dobro.ua platform (the report is available here).

Our goal was to draw the world community's attention to the horrors that Russia (the terrorist state) commits in Ukraine. At the same time, we wanted to show the courage, bravery, and humanity of Ukrainians and raise funds to help the victims of the war.

Day 85 — Heroes of Azovstal hold defense of Mariupol for 85 days**

Ongoing technology development

Technologies evolve constantly. Some solutions provide more realistic generations. Since we have started working on Sirens Gallery, Stable Diffusion, Imagen, Midjourney, and DALL·E 2 have been released.

A few weeks ago, the developers released the Stable Diffusion weights. Thanks to this a lot of new solutions began to appear. For example, a collab with an interface common to Disco Diffusion users, where you can generate your own images, 2D and 3D animations either from text or from image.

The Stable Diffusion team has also created an API. Based on it, some Photoshop), GIMP, and Blender plugins have already appeared. There is an option with texture generation for Blender and even a solution for Krita open-source digital painting project. You can learn more about these plugins here and here.

You can look through a large database of Stable Diffusion generation. And there’s also a guide for newbies on how to get started with Stable Diffusion on your PC.

At Sirens Gallery we haven’t counted on technology perfection, our project is rather about what you can achieve using new technologies. You can find our stories and corresponding AI-generated illustrations on our social networks: Instagram and Twitter. They are also available on the Sirens Gallery project website.

Day 4 — Russian helicopters get destroyed in Chornobayivka

Recently, together with Save Ukraine, we held the first offline exhibition of AI-generated pictures, which depict the stories of saving Ukrainian children from the war. In the following months, the paintings will be presented on three continents. Afterward, they will be sold in New York. All funds will be transferred to help Ukrainians who became victims of Russian aggression.