from Hacker News

No elephants: Breakthroughs in image generation

by Kerrick on 4/5/25, 3:51 AM with 364 comments

by x187463 on 4/8/25, 10:07 AM
This is a before/after moment for image generation. A simple example is the background images on a ton of (mediocre) music youtube channels. They almost all use AI generated images that are full of nonsense the closer you look. Jazz channels will feature coffee shops with garbled text on the menu and furniture blending together. I bet all of that disappears over the next few months.
On another note, and perhaps others are feeling similarly, but I am finding myself surprised at how little use I have for this stuff, LLMs included. If, ten years ago, you told me I would have access to tools like this, I'm sure I would have responded with a never ending stream of ideas and excitement. But now that they're here, I just sort of poke at it for a minute and carry on with my day.
Maybe it's the unreliability on all fronts, I don't know. I ask a lot of programming questions and appreciate some of the autocomplete in vscode, but I know I'm not anywhere close to taking full advantage of what these systems can do.
by card_zero on 4/8/25, 10:22 AM
Looking at the example where the coffee table is swapped, I notice every time the image is reprocessed it mutates, based on the previous iteration, and objects become more bizarre each time, like chinese whispers.
* The weird-ass basket decoration on the table originally has some big chain links (maybe anchor chain, to keep the theme with the beach painting). By the third version, they're leathery and are merging with the basket.
* The candelabra light on the wall, with branch decorations, turns into a sort of skinny minimalist gold stag head, and then just a branch.
* The small table in the background gradually loses one of its three legs, and ends up defying gravity.
* The freaky green lamps in the window become at first more regular, then turn into topiary.
* Making the carpet less faded turns up the saturation on everything else, too, including the wood the table is made from.
by nowittyusername on 4/8/25, 4:04 PM
There is circumstantial evidence out there that 4o image manipulation isn't done within the 4o image generator in one shot but is a workflow done by an agentic system. Meaning this, user inputs prompt "create an image with no elephants in the room" > prompt goes to an llm which preprocesses the human prompt > outputs a a prompt that it knows works withing this image generator well > create an image of a room > and that llm processed prompt is sent to the image generator. Same happens with edits but a lot more complicated, meaning function calling tools are involved with many layers of edits being done behind the scenes. Try it yourself, take an image, send it it, and have the 4o edit it for you in some way, then ask it to edit again, and again, and so on. you will notice noticeable sepia filter being applied every edit, and the image ends up more and more sepia toned with more edits. This is because in the workflow that is one of the steps that is naively applied without consideration of multi edit possibility. If this was a one shot solution where editing is done within 4o image model by itself, the sepia problem wouldn't be there.
by probably_wrong on 4/8/25, 10:39 AM
> Is it okay to reproduce the hard-won style of other artists using AI? Who owns the resulting art? Who profits from it? Which artists are in the training data for AI, and what is the legal and ethical status of using copyrighted work for training? These were important questions before multimodal AI, but now developing answers to them is increasingly urgent.
I have to disagree with the conclusion. This was an important discussion to have two to three years ago, then we had it online, and then we more or less agreed that it's unfair for artists to have their works sucked up with no recourse.
What the post should say is "we know that this is unfair to artists, but the tech companies are making too much money from them and we have no way to force them to change".
by shubhamjain on 4/8/25, 11:32 AM
The Ghibli trend completely missed the real breakthrough — and it’s this. The ability to closely follow text, understand the input image, and maintain context of what’s already there is a massive leap in image generation. While Midjourney delivered visually stunning results, I constantly struggled to get anything specific out of it, making it pretty much useless for actual workflows.
4o is the first image generation model that feels genuinely useful not just for pretty things. It can produce comics, app designs, UI mockups, storyboards, marketing assets, and so on. I saw someone make a multi-panel comic with it with consistent characters. Obviously, it's not perfect. But just getting there 90% is a game changer.
by gcanyon on 4/8/25, 11:43 AM
It's interesting to hear people side with the artists when in previous discussions on this forum I've gotten significant approval/agreement arguing that copyright is far too long.
As I've argued in the past, I think copyright should last maybe five years: in this modern era, monetizing your work doesn't (usually) have to take more than a short time. I'd happily concede to some sort of renewal process to extend that period, especially if some monetization method is in process. Or some sort of mechanical rights process to replace the "public domain" phase early on. Or something -- I haven't thought about it that deeply.
So thinking about that in this process: everyone is "ghiblifying" things. Studio Ghibli has been around for very nearly 40 years, and their "style" was well established over 35 years ago. To me, that (should) make(s) it fair game.
The underlying assumption, I think, is that all the "starving" artists are being ripped off, but are they? Let's consider the numbers -- there are a handful of large-scale artists whose work is obviously replicable: Ghibli, the Simpsons, Pixar, etc. None of them is going hungry because a machine model can render a prom pic in their style. Then you get the other 99.999% of artists, all of whose work went into the model. They will be hurt, but not specifically because their style has been ingested and people want to replicate their style.
Rather, they will be hurt because no one knows their style, nor cares about it; people just want to be able to say e.g. "Make a charcoal illustration of me in this photo, but make me sitting on a horse in the mountains."
It's very much like the arguments about piracy in the past: 99.99% of people were never going to pay an artist to create that charcoal sketch. The 0.01% who might are arguably causing harm to the artist(s) by not using them to create that thing, but the rest were never going to pay for it in the first place.
All to say it's complicated, and obviously things are changing dramatically, but it's difficult to make the argument that "artists need to be compensated for their work being used to train the model" without both a reasonable plan for how that might be done, and a better-supported argument for why.
by haswell on 4/8/25, 12:04 PM
> The question isn't whether these tools will change visual media, but whether we'll be thoughtful enough to shape that change intentionally.
Unfortunately I think the answer to this question is a resounding “no”.
The time for thoughtful shaping was a few years ago. It feels like we’re hurtling toward a future where instead we’ll be left picking up the pieces and assessing the damage.
These tools are impressive and will undoubtedly unlock new possibilities for existing artists and for people who are otherwise unable to create art.
But I think it’s going to be a rough ride, and whatever new equilibrium we reach will be the result of much turmoil.
Employment for artists won’t disappear, but certain segments of the market will just use AI because it’s faster, cheaper, and doesn’t require time consuming iterations and communication of vision. The results will be “good enough” for many.
I say this as someone who has found these tools incredibly helpful for thinking. I have aphantasia, and my ability to visualize via AI is pretty remarkable. But I can’t bring myself to actually publish these visualizations. A growing number of blogs and YouTube channels don’t share these qualms and every time I encounter them in the wild I feel an “ick”. It’ll be interesting to see if more people develop this feeling.
by justinator on 4/8/25, 4:11 PM
But the annotations are still wrong,
https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_...
(nice URL btw)
The room, the door, the ceiling are all of a scale to fit many sizes of elephants.
by m4thfr34k on 4/8/25, 8:18 PM
I am very impressed with the current image generators out there, 4o / Leonardo / etc., but I cannot wait until they include some step to actually "check their work". Ask it to produce a watch with the time of 6:37. It fails every time, because almost all watch photos out there are set to a specific time, and seems like something an initial "did I do this right" check could confirm. The time example is trivial but a general "does this output actually make sense considering what the user asked" checked would be tremendously valuable.
by Retr0id on 4/8/25, 10:14 AM
I had a reasonable intuition for how the "old" method works, but I still don't grok this new approach.
"in multimodal image generation, images are created in the same way that LLMs create text, a token at a time"
Is there some way to visualise these "image tokens", in the same way I can view tokenized text?
by NitpickLawyer on 4/8/25, 10:24 AM
> The results are not as good as a professional designer could create but are an impressive first prototype.
I like to look at how far we've come since the early days of Stable Diffusion. It was fascinating to play with it back then, but it quickly became apparent that it was "generic" and not suited for "real work" because it lacked consistency, text capabilities, fingers! and so on... Looking at these results now, I'm amazed at the quality, consistency and ease of use. Gone are the days of doing alchemy on words and adding a bunch of "in the style of Rutkovsky, golden hour, hd, 4k, pretty please ..." at the end of prompts.
by smusamashah on 4/8/25, 3:43 PM
I am waiting for when I could provide these a scene snippet from "Hitchhiker's Guide To Galaxy" (or any book) and it could draw that for me. The gold planets, the waking up on the beach, total perspective vortex etc.
I like the book, but there are quite a few scenes which are quite hard to visualize and make sense. An image generator that can follow that language and detail will be amazing. Even more awesome will be if it remains consistent in follow ups.
by orbital-decay on 4/8/25, 11:27 AM
4o still exhibits the "pink elephant effect", it's just... subtler, and tends to reveal itself on a complex or confusing prompt. Negations are also still not handled properly, they tend to slightly confuse the model and decrease the accuracy of the answer or the generated picture. The same is true for any other LLM. Moreover, the author is asking the model to rationalize the decision he already made ("tell me why there can't be any elephants"), which could work as an equivalent to a CoT step.
It's "just" a much bigger and much better trained model. Which is a quality on its own, absolutely no doubt about that. Fundamentally the issue is still there though, just less prominent. Which kind of makes sense - imagine the prompt "not green", what even is that? It's likely slightly out of distribution and requires representing a more complex abstraction, so the accuracy will necessarily be worse than stating the range of colors directly. The result might be accurate, until the model is confused/misdirected by something else, and suddenly it's not.
I think in the end none of the architectural differences will matter beyond the scaling. What will matter a lot more is data diversity and training quality.
by ziofill on 4/8/25, 4:36 PM
I usually agree with most of Gary Marcus' points, but I'd really like to hear his take on this. One of his examples is that "the system can't generate a horse riding an astronaut" and in fact I tried a lot in the past but it would always draw the astronaut on top of the horse. Well, here is the result now: https://postimg.cc/QFtRjbHM
by hansmayer on 4/8/25, 3:20 PM
> " Image generation is likely to be very disruptive in ways we don’t understand right now. " Is anyone getting tired of these formulations ? When a tech is disruptive, we know it immediately. Uber was disruptive. AirBnB, Gmail, Amazon, even Facebook at one point. You just knew it, nobody was writing long essays trying to justify those products. Robots generating statistically median images is impressive, but not disruptive at all. If something is "likely" to be "disruptive", but in ways "we don't understand yet", how can the claim even be made? What is it based on? If we do not understand it yet, how can we understand if it is "likely to be disruptive".
by xnorswap on 4/8/25, 3:11 PM
The "How to build a boardgame" infographic looks like half my linkedin "feed" now, but a boardgame instead of random basic programming / recruimentment / sales topic.
Feed is in quotes because my feed seems to be 90% suggested posts.
by morkalork on 4/8/25, 1:29 PM
Huh, the coffee table reminds me of all those cheap e-retailers who very clearly (and badly) photoshop their clothes on to same 2 or 3 stock model images. If you thought shopping online sucked before, it's just going to get even worse now.
by Zr01 on 4/9/25, 11:17 AM
I'm more interested in the technical details than the publicity. Pretty much anyone these days can learn what a diffusion model is, how they're implemented, what the control flow is. What about this new multimodal LLM? They have no problems with text, they generate images using tokens, but how exactly? There's no open-source implementations that I know of, and I'm struggling to find details.
by lou1306 on 4/8/25, 4:52 PM
Putting my Wittgenstein hat on: How can I ever be sure that the machine is not generating an incredibly tiny elephant, maybe hidden under the sofa?
by eapriv on 4/8/25, 12:19 PM
It’s always fun to read posts like that: they say “look at this amazing thing it drew”, and the image is utter garbage.
by klik99 on 4/8/25, 5:07 PM
I’ve seen a few YouTube thumbnail generation examples on Reddit (I’m on vacation so not gonna search for a link) that show multimodal with inline text giving specific instructions. It’s impressed me in a way that I haven’t been with LLMs for 2 years, IE it’s not just getting better at what it already does, but a totally new and intuitive way of working with generative AI.
My understanding is it’s a meta-LLM approach, using multiple models and having them interact. I feel like it’s also evidence that OpenAI is not seriously pursuing AGI (just my opinion, I know there’s some on here who would aggressively disagree), but rather market use cases. It feels like an acceptance that any given model, at least now, has its own limitations but can get more useful in combination.
by qiqitori on 4/8/25, 11:45 AM
Wha- wha- what? I tried to generate an image in ChatGPT after the announcement a while back and the image wasn't bad, but the text on it (numbers) was nonsense. (Analog gauge with nonsense numbers instead of e.g. 10, 20, 30, 40, etc.)
Gave it another chance now, explicitly calling out the numbers. Well, they are improved but not sure how useful this result is (the spacing between numbers is a little off and there's still some curious counting going on. Maybe it kind of looks like the numbers are pasted in after the fact?
https://chatgpt.com/share/67f4fa33-70dc-8012-8e1e-2dea563d3d...
by cadamsdotcom on 4/8/25, 4:56 PM
Anyone stuck claiming AI isn’t useful - there are so many useful things it can now do. With text that makes sense you can generate invitations for your next picnic. That wasn’t possible mere weeks ago.
Wonderful to be alive for these step changes in human capability.
by vunderba on 4/8/25, 2:54 PM
4o, despite OpenAI's practically draconian content policies, is a pretty big leap forward. I put together a comparison of some of the most competitive generative models (Imagen, 4o, Flux, and MJ7) where I prioritized increasingly difficult prompt adherence. If Imagen 3 had 4o's multimodal capabilities (being able to make constant adjustments against a generated image by prompting) I would say its nearly on-par with 4o.
https://genai-showdown.specr.net
by rkharsan64 on 4/8/25, 11:10 AM
Are there any local models that use this new approach to generating images?
by roenxi on 4/8/25, 9:55 AM
That proper "no elephants" first image is hilarious. Another key point here is the generative AI's meme game is getting rather strong.
Which isn't a small thing, humour is an advanced soft skill.
by swframe2 on 4/9/25, 5:12 AM
I'm hoping this alternative image prompt preprocessing technique gets more attention:
https://art-msra.github.io/
Basically, the user's image prompt is converted to a several prompts to generate parts of the final image in layers which are combined. The layers are still available so that edits can cleanly update one section without affecting the others.
by lupusreal on 4/8/25, 4:31 PM
The image annotated to explain why no elephants are possible is very amusing.
To me, this kind of image generation isn't very interesting for creating final products, but is extremely useful for communicating design intent to other people when collaborating on large creative projects. Previously I used crude "ms paint" sketches for this, which was much more tedious and less effective.
by thrance on 4/8/25, 11:40 AM
Each generation follows the prompt a little bit better than the last, but I don't see any revolutions. Fingers are still messed up, eyes are wonky and legs sometimes still fork into two. Fundamentally it's still the same diffusion technique, with the same limitations.
by DonHopkins on 4/8/25, 4:51 PM
Q: How do you know if there's an elephant hiding under your bed?
A: Your face is pressed up against the ceiling!
by mrconter11 on 4/9/25, 10:03 AM
Isn't it ironic that it ended up being harder to get a computer to explicitly not create a photorealistic image of an elephant than to have create one?
by freeamz on 4/8/25, 11:33 AM
Hmmm isn't stable diffusion already doing that?
by Der_Einzige on 4/8/25, 3:34 PM
The #1 reason that this technology won't proliferate more quickly is that humans are a bunch of COOMERS!
We get Stable Diffusion V1.5 and SDXL and what does the community go do with it? Lmao see civit.ai and it's literal hundreds of thousands of NSFW loras. The most popular model today on that website is the NSFW anime version of SDXL, called "Pony Diffusion" (I'm literally not making this up. A bunch of Bronies made this model!)
Imagine that an open source image generator which does tokens autoregressively like this at this quality is released.
The world is simply not ready for the amount of horny stuff that is going to be produced (especially without consent). It appears that the male libido really is the reason for most bad things in the world. We are truly the "villains of history".
by NiloCK on 4/8/25, 12:27 PM
The 'before' image passes the test this time in a "Treachery of Images" sort of way.
by d4rkp4ttern on 4/8/25, 11:30 AM
Diagrams are still a big unsolved problem. Making diagrams for a talk or paper is an extremely tedious process and I am still waiting for a good multimodal LLM solution for this. It should take a sketch and/or text description of what you want and in a few iterations you should get what you want. GPT4o tries hard but results are still bad.
by globnomulous on 4/9/25, 5:00 PM
> The past couple years have been spent trying to figure out what text AI models are good for, and new use cases are being developed continuously.
In other words, people who care about money and only money are pushing for these tools because they're convinced they'll reduce labor costs and somehow also improve the resulting product, while engineers and creative professionals who have these tools foisted upon them by unimaginative business people continue to insist that the tools are a solution in search of a problem, that they're stochastic parrots and plagiarism automata that bypass all of the important parts of engineering and creativity and make the absolutely, breathtakingly idiotic mistake of supposing it's possible to leap to a finished product without all the work and problem solving involved in getting there.
> The line between human and AI creation will continue to blur
This is utter nonsense, and hype-man prognosticators in the tech world like the author of the article turn out pretty much 100% of the time to be either grifters or saps who have fallen for the grifters' nonsense.
by 1970-01-01 on 4/8/25, 6:03 PM
For the first 9 years of an elephant's life, it can easily walk into that room. I don't find this to be a breakthrough. I find it to be clickbait.
by ge96 on 4/8/25, 2:57 PM
> we guac you covered