by adam_gyroscope on 3/16/24, 4:43 PM with 58 comments
by kelseyfrog on 3/16/24, 5:46 PM
Think about how we chunk words[1] and recognize them. We have whole word(shape recognition), morphme recognition, and spelling(letter-by-letter chunking). Text models receive tokens(akin to morpheme chunks) and don't have access to the underlying letters(spelling data) unless that was part of their training. For the most part, individual letters, something I think we can agree is necessary for rendering text, is not accessible.
An appropriate analogy is an illiterate artist. Someone who can hear chunks of words and recognizes them verbally I'd asked to do their best job at painting text. They can piece together letter clusters based on inference, but they cannot spell.
by candrewlee14 on 3/16/24, 5:26 PM
- looking at your hands
- looking at clocks
- trying to read
It’s funny that diffusion models often make those exact same mistakes. There’s clearly a similar failure mode where both are drawing from a distribution and losing fine details. Has this been studied?
by disconcision on 3/16/24, 5:33 PM
text like hands belong to class of imagery satisfying two characteristics: 1) They are intricately structured, having many subcomponents which have precise spatial interrelationships over a range of scales; there are a lot of ways to make things that are like text/hands except wrong 2) The average person is intimately familiar with said structures, having spent thousands of hours engaging looking at them while performing complex tasks involving a visiospatial feedback loop.
image generation models tend to have trouble with (1), but people only tend to notice it when paired with (2).
(1) can be improved by scale and more balanced training data; consider that for a person, their own hands are very frequently in their own field of view, but the photos they take only rarely feature hands as the focus. this creates a differential bias.
as for (2), image models tend to generate all kinds of implausibilities that the average person doesn't notice. try generating a complex landscape and ask a geologist how it formed.
by db48x on 3/16/24, 4:59 PM
by raesene9 on 3/16/24, 6:28 PM
As a test I just tried ChatGPT with the prompt :-
Hi ChatGPT can you give me a picture of a banner that says "Hacker news"
And the resultant image does indeed have that text on it. Where I've seen this approach fall down, is where the text is long and/or complex or the words are uncommon.
so while there's some way to go, things are definitely improving here.
by reader5000 on 3/16/24, 5:31 PM
This could be entirely wrong however.
It would be interesting what would happen on a dataset with nothing but text.
by swyx on 3/16/24, 5:36 PM
in other words.. what have you actually tried? be specific.
by huevosabio on 3/16/24, 5:44 PM
by jvm___ on 3/16/24, 5:37 PM
Generating a cat that looks LIKE a cat is fine because there are differences between cats.
The problem is that you can't make something that looks LIKE a letter K, it needs to satisfy the rules of K and can't just look LIKE a K and not some made up character.
They're LIKE generators and have trouble with the bits that need to be exact.
by spiderxxxx on 3/16/24, 6:45 PM
My suggestion is to use Image to Image, start with the text of your son's name, and give it some gaussian noise background, and then paint out the parts you want to keep.
by Frummy on 3/16/24, 5:42 PM
by Zetobal on 3/16/24, 6:15 PM
Newer models like cascade or SD 3 are using multimodal llms to caption images including text. Dall-E was at the forefront because they had access to gpt4-vision before everyone else. You will see that all new models will be able to spell. The problems we see are still mostly because of gigo.
by bxguff on 3/16/24, 6:41 PM
by d-z-m on 3/16/24, 6:14 PM
by sfmz on 3/16/24, 5:39 PM
https://medium.com/community-driven-ai/midjourney-can-spell-...
by nextaccountic on 3/16/24, 7:14 PM
by MatthiasPortzel on 3/16/24, 5:05 PM
There’s an anecdote about blind men whose sight was restored. They were adult men, who had felt cubes and heard about cubes, and could describe a cube. After their sight was restored, they were shown a cube and a sphere and were asked to identify them by sight. They were unable to, having never seen these objects before.
Many people (including very smart people) make the mistake of equating all forms of intelligence. They assume that computer programs have an intelligence level, and should be able to handle all tasks below that intelligence level, but machine learning models break our intuition for this. A model which has been trained on stock market data and is extremely intelligent in this area may be able to predict the stock market tomorrow. But if it has not been trained on words than it is no more able to write a sentence than a newborn baby. ChatGPT can eloquently generate words but it is completely unable to generate or understand pictures. (Ask ChatGPT to generate some ASCII-art.) Eventually OpenAI will create a sophisticated multi-modal model capable of generating poems or reading words in an image or predicting the stock market, but this model will be completely unable to answer questions about the physical world, because it’s only been trained on words and images.
by Wowfunhappy on 3/16/24, 6:17 PM
If an AI that could draw was also able to write, that would be artificial general intelligence. And pretty much everyone seems to agree we don't have that yet.