from Hacker News

Ask HN: Why we need complex text-to-image networks if this simple one works?

by freakynit on 6/28/24, 5:36 AM with 4 comments

Made sonnet-3.5 write a simple text-to-image program. Trained it on mnist dataset with 50 epochs. Training took like 20 minutes only on my M1 mac with 8GB RAM only.

It was able to produce very good images based on training data. And is such a simple network.

My question is: why is all that extra complexity needed in today's text-to-image models based on transformers? Wouldn't scaling this out work equally well?

Code: https://gist.github.com/freakynit/1118403ad80448ee0313ba6c879f8688

Generated image: https://imgur.com/LCHDBhI

by bjourne on 6/28/24, 1:09 PM
But the images your network generates look nothing like MNIST digits. There is also no variance. For example, all 9s are identical.
by p1esk on 6/28/24, 6:03 PM
For MNIST your model is sufficient. But it’s not structurally complex enough to generate more complex images, even if you scale it up.
by Am4TIfIsER0ppos on 6/28/24, 9:40 AM
Have you employed a dozen ethicists to go through the input and output to make sure it can't say any slurs? [EDIT] That's why small ones don't exist