Stable Diffusion, Unstable Me: Text-to-image Generation

Text-to-image generation is not a new idea. Notably, the Generative Adversarial Network (GAN) architecture, a once-popular deep-learning computer vision algorithm had generated birds and flowers from text.

After more improvements to the generative image algorithms like hyperealistic generation of human faces and GLIDE diffusion model, we now have a deluge of commercial text to image generators, ranging from Google’s Imagen, Tiktok’s Greenscreen and OpenAI Dall-E.

And here comes the new kid on the block, `Stable Diffusion`

Before moving ahead, for those who wants to understand the nitty-gritties of diffusion models, I strongly encourage readers to go through this awesome blogpost by Lilian Weng, https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ and this tweet by Tom Goldstein.

The rest of the article would be an exercise of using the stable diffusion model to generate images given “my name”. There is no main purpose for doing so, other than satisfying my curiosity.

Narcissistic as it sounds, I want to generate an image with my name.

Be honest, don’t we all “google your own name” from time to time? And thus, I went to https://huggingface.co/spaces/stabilityai/stable-diffusion and typed in Liling Tan.

Before I show the generated results, I would like to clarify some specifics about my first and last name. In general, it is more probable that Liling a English romanization of my Chinese character name refers to a female individual than a male. Next Tan (陳) is a common Southern Chinese name that originates from the Min language, and more commonly you would expect the English translation/romanization of Mandarin Chinese, Chen . With K-Pop being a global phenomenon, the top-ranked Google result of Chen would point you to the Korean Singer from the K-pop band EXO. Also, since Tan (my last name), has the same spelling as tan the color, the top Google search will end up with results pointing to “a yellowish-brown color”.

For reference, here are the control experiments results, if I googled “liling tan”

And now, the results…

Okay, that’s definitely nowhere close to how I look.

I kind of expected the image to show an Asian female but the first image was kinda weird.

Now, what happens if I lowercase and re-run the generation?

Alright, the model seems hell-bent on some facial features and sorta generating one older and another younger version of “liling tan”

Interestingly, the model has some internal mechanisms to block some "unsafe” content and output this error message.

This Image was not displayed because our detection model detected Unsafe content.

Then I got really curious, do the two versions of “liling tan” exist IRL (in real life)?

So I do the natural thing to reverse image search with https://lens.google.com/

Hmmm, no results, lets extend the frame to beyond the face…

And of course, what was I expecting, surely Google Shopping will take the chance to advertise and sell me something -_-

Maybe the older version generated by the model is more grounded to someone IRL?

Image Search Result 3

Hmmm, no results, so the model kinda generated two unique people that doesn’t exist IRL?

But, we see two separate dots that indicates two other results exists, let’s see the first one at the bottom.

Of course, it will try to sell me something again… What was I expecting? @_@

Let’s try the results other dot. Now this is interesting, it’s trying to promote a Girl with a Pearl Earring (ca. 1665) by Johannes Vermeer inspired piece.

But what about “Find image source”?

Does it really find the source of the images that the model use to slice, dice and “diffuse”?

It’s hard to say:

Đợi tí (Wait a minute),does that mean that we don’t know which image the model has been trained on or used to splice before generating the results?

Now that we find the underbelly from the results, lets backtrack the OG paper listed on the original source code https://github.com/CompVis/stable-diffusion

High-Resolution Image Synthesis with Latent Diffusion Models. Robin Rombach*, Andreas Blattmann*, Dominik Lorenz, Patrick Esser, Björn Ommer. In CVPR '22.

According to the paper, the model is pre-trained on the LAION database https://laion.ai/blog/laion-5b/ and the Conceptual Captions dataset https://ai.google.com/research/ConceptualCaptions/.

Our model is pretrained on the LAION [73] database and finetuned on the Conceptual Captions [74] dataset.

What if, we build a reverse image search based on the datasets?

Perhaps if we can find the approximate nearest neighbor images to the outputs generated by the model, then it might be possible to “explain away” which images the model had deemed to be “salient” enough to diffuse and generate the outputs from my name.

Voila! Here’s a search engine to probe the dataset used to train Stable Diffusion, https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/

And here’s the list of results of searching for `Liling` in the LAION database.

Ah ha, that is one image that the model must have “chosen to diffuse”

It is highly possible that the model somehow “focused” an image in the training data with “Liling” in the caption and generated an image similar to that image.

Why did you bother to “diffuse <your_name>”?

To conclude the post, this exercise is purely out of curiosity. I wanted to know what the model would generate. But this exercise also highlights some sort of bias when using generative models. While “Liling Tan” isn’t stable enough to generate anything close to me or any other top-ranked “Liling Tan” search results, the model seems to be more stable in generating famous people, (e.g. John Oliver).

To end this article, here’s a result from generating images with my online handle alvations as the prompt…