from Hacker News

Ask HN: Why can't image generation models spell?

by adam_gyroscope on 3/16/24, 4:43 PM with 58 comments

I’ve tried using the various state of the art commercially available models to generate artwork for my kids room, with his name in letters. None can render an image with his name properly spelled, and even after long chats where I explain the problem the models fail to spell correctly. It’s a common five letter name. What can’t the models spell?

by kelseyfrog on 3/16/24, 5:46 PM
Because tokenization is an imperfect way to encode the required information.
Think about how we chunk words[1] and recognize them. We have whole word(shape recognition), morphme recognition, and spelling(letter-by-letter chunking). Text models receive tokens(akin to morpheme chunks) and don't have access to the underlying letters(spelling data) unless that was part of their training. For the most part, individual letters, something I think we can agree is necessary for rendering text, is not accessible.
An appropriate analogy is an illiterate artist. Someone who can hear chunks of words and recognizes them verbally I'd asked to do their best job at painting text. They can piece together letter clusters based on inference, but they cannot spell.
1. https://en.m.wikipedia.org/wiki/Chunking_(psychology)
by candrewlee14 on 3/16/24, 5:26 PM
Very interesting to me that our dreams make some of the same mistakes. Some of the usual reality checks to know if you’re dreaming:
- looking at your hands
- looking at clocks
- trying to read
It’s funny that diffusion models often make those exact same mistakes. There’s clearly a similar failure mode where both are drawing from a distribution and losing fine details. Has this been studied?
by disconcision on 3/16/24, 5:33 PM
(the following is speculation)
text like hands belong to class of imagery satisfying two characteristics: 1) They are intricately structured, having many subcomponents which have precise spatial interrelationships over a range of scales; there are a lot of ways to make things that are like text/hands except wrong 2) The average person is intimately familiar with said structures, having spent thousands of hours engaging looking at them while performing complex tasks involving a visiospatial feedback loop.
image generation models tend to have trouble with (1), but people only tend to notice it when paired with (2).
(1) can be improved by scale and more balanced training data; consider that for a person, their own hands are very frequently in their own field of view, but the photos they take only rarely feature hands as the focus. this creates a differential bias.
as for (2), image models tend to generate all kinds of implausibilities that the average person doesn't notice. try generating a complex landscape and ask a geologist how it formed.
by db48x on 3/16/24, 4:59 PM
For the same reason they cannot do hands very well. There just aren’t enough bits in their internal representations to encode specific details of that type. Scaling up the model can help, but that requires something on the order of 100× as much computer time and storage. Since the model you are using may already have taken months and dozens or hundreds of millions of dollars to create, the company behind it may be reluctant to spend years and billions to replace it.
by raesene9 on 3/16/24, 6:28 PM
My experience is that whilst it's not perfect, modern models can create images with text that is correct, relatively often.
As a test I just tried ChatGPT with the prompt :-
Hi ChatGPT can you give me a picture of a banner that says "Hacker news"
And the resultant image does indeed have that text on it. Where I've seen this approach fall down, is where the text is long and/or complex or the words are uncommon.
so while there's some way to go, things are definitely improving here.
by reader5000 on 3/16/24, 5:31 PM
My intuition would be that both text and fingers are high frequency parts of the image and most image patches across most images don't have that amount of frequency. Text and fingers are relatively rare in images compared to e.g. clouds and textures on surfaces. Because of the rareness and difficulty of text image patches the models just don't dedicate that many parameters to it.
This could be entirely wrong however.
It would be interesting what would happen on a dataset with nothing but text.
by swyx on 3/16/24, 5:36 PM
the question is incorrect. they -can- spell. sometimes very long phrases. five letter words should have no issue. have you tried using ideogram? or even just dalle3 prompted well? https://twitter.com/swyx/status/1765091085943218571
in other words.. what have you actually tried? be specific.
by huevosabio on 3/16/24, 5:44 PM
Try Ideogram! V1 is pretty good with spelling out words
https://ideogram.ai/login
by jvm___ on 3/16/24, 5:37 PM
Because they can only generate something that looks LIKE the thing.
Generating a cat that looks LIKE a cat is fine because there are differences between cats.
The problem is that you can't make something that looks LIKE a letter K, it needs to satisfy the rules of K and can't just look LIKE a K and not some made up character.
They're LIKE generators and have trouble with the bits that need to be exact.
by spiderxxxx on 3/16/24, 6:45 PM
What is spelling but putting into letters what is heard? If you can give it the text, then it's not spelling is it, it would be copying! I think you want to give it ideas, that will translate into words, which is certainly not spelling. It's creating. SD models start with simply static, and it's asked to find some pattern and expand upon it until it matches the pattern better. Letters on a sign for example are not right or wrong by their placement, but upon judgement by someone who has the knowledge of the language. Sale might mean 'salt' if you speak italian, but it might mean there's a discount for a short period of time if you speak english. Sensibel might seem like a misspelling if sensible if you speak english, but it's perfectly correct spelling in German.
My suggestion is to use Image to Image, start with the text of your son's name, and give it some gaussian noise background, and then paint out the parts you want to keep.
by Frummy on 3/16/24, 5:42 PM
It's not like writing where it grabs a virtual pen and draws the letters one by one, or types the letters one by one. The image is diffused everywhere at once, and if you for example close your eyes and imagine the word "generation" in symbols as an image, it's quite difficult to visualise all the details in the right order in the right detail and so on. The model does it in multiple layers but starts with a slightly defined "hunch" in full image size and just makes the initial hunch more legible with each layer
by Zetobal on 3/16/24, 6:15 PM
It's the dataset most of the images that were tagged don't have the text shown in the captions. We do a lot of car loras and if you tag the shown text as an example on the numberplate you can prompt/replace it in your prompt without problem.
Newer models like cascade or SD 3 are using multimodal llms to caption images including text. Dall-E was at the forefront because they had access to gpt4-vision before everyone else. You will see that all new models will be able to spell. The problems we see are still mostly because of gigo.
by bxguff on 3/16/24, 6:41 PM
I think it just boils down to what they were trained on, some models do better when the training sets are more specific even if they're smaller sometimes, so the engineers chase better wholesale performance while leaving some of the weirder edge cases to be cleaned up later eg text generation. maybe start with the image and try adding the text after in a separate prompt if you haven't already?
by d-z-m on 3/16/24, 6:14 PM
If you have the resources to run a model locally, you can try ControlNet with a reference image. If you dial the strength up enough, it will coerce the model into spelling correctly(at the cost of the generated image looking an awful lot like the reference image: font/size/etc.)
by sfmz on 3/16/24, 5:39 PM
Many articles/videos available about how Midjourney can spell since version 6.
https://medium.com/community-driven-ai/midjourney-can-spell-...
by nextaccountic on 3/16/24, 7:14 PM
Can't SD3 spell?
by MatthiasPortzel on 3/16/24, 5:05 PM
Because they’re trained to generate images, not words.
There’s an anecdote about blind men whose sight was restored. They were adult men, who had felt cubes and heard about cubes, and could describe a cube. After their sight was restored, they were shown a cube and a sphere and were asked to identify them by sight. They were unable to, having never seen these objects before.
Many people (including very smart people) make the mistake of equating all forms of intelligence. They assume that computer programs have an intelligence level, and should be able to handle all tasks below that intelligence level, but machine learning models break our intuition for this. A model which has been trained on stock market data and is extremely intelligent in this area may be able to predict the stock market tomorrow. But if it has not been trained on words than it is no more able to write a sentence than a newborn baby. ChatGPT can eloquently generate words but it is completely unable to generate or understand pictures. (Ask ChatGPT to generate some ASCII-art.) Eventually OpenAI will create a sophisticated multi-modal model capable of generating poems or reading words in an image or predicting the stock market, but this model will be completely unable to answer questions about the physical world, because it’s only been trained on words and images.
by Wowfunhappy on 3/16/24, 6:17 PM
This isn't a scientific answer, but it feels pretty intuitive to me that writing and drawing are very different skills.
If an AI that could draw was also able to write, that would be artificial general intelligence. And pretty much everyone seems to agree we don't have that yet.