by meetpateltech on 5/7/25, 4:06 PM with 100 comments
by vunderba on 5/7/25, 7:21 PM
https://genai-showdown.specr.net
I don't know how much of Google's original Imagen 3.0 is incorporated into this new model, but the overall aesthetic quality seems to be unfortunately significantly worse.
The big "wins" are:
- Multimodal aspect in trying to keep parity with OpenAI's offerings.
- An order of magnitude faster than OpenAI 4o image gen
by simonw on 5/7/25, 10:04 PM
curl -s -X POST \
"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash-preview-image-generation:generateContent?key=$GEMINI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"contents": [{
"parts": [
{"text": "Provide a vegetarian recipe for butter chicken but with chickpeas not chicken and include many inline illustrations along the way"}
]
}],
"generationConfig":{"responseModalities":["TEXT","IMAGE"]}
}' > /tmp/out.json
And got back 41MB of JSON with 28 base64 images in it: https://gist.github.com/simonw/55894032b2c60b35f320b6a166ded...At 4c per image that's more than a dollar on that single prompt.
I built this quick tool https://tools.simonwillison.net/gemini-image-json for pasting that JSON into to see it rendered.
by eminence32 on 5/7/25, 5:38 PM
I'm sure part of this is a lack of imagination on my part about how to describe the vague image in my own head. But I guess I have a lot of doubts about using a conversational interface for this kind of stuff
by refulgentis on 5/7/25, 5:48 PM
Now I can use:
- Gemini 2.0 Flash Image Generation Preview (May) instead of Gemini 2.0 Flash Image Generation Preview (March)
- or when I need text, Gemini 2.5 Flash Thinking 04-17 Preview ("natively multimodal" w/o image generation)
- When I need to control thinking budgets, I can do that with Gemini 2.5 Flash Preview 04-17, with not-thinking at a 50% price increase over a month prior
- And when I need realtime, fallback to Gemini 2.0 Flash 001 Live Preview (announced as In Preview on April 9 2025 after the Multimodal Live API was announced as released on December 11 2024)
- I can't control Gemini 2.5 Pro Experimental/Preview/Preview IO Edition's thinking budgets, but good news follows in the next bullet: they'll swap the model out underneath me with one that thinks ~10x less so at least its in the same cost ballpark as their competitors
- and we all got autoupgraded from Gemini 2.5 Pro Preview (03/25 released 4/2) to Gemini 2.5 Pro Preview (IO Edition) yesterday! Yay!
by cush on 5/7/25, 5:54 PM
https://aistudio.google.com/apps/bundled/gemini-co-drawing?s...
by minimaxir on 5/7/25, 6:00 PM
The main difference is that Gemini does allow for incorporating a conversation to generate the image as demoed here, while Imagen 3 is a strict text-in/image-out with optional mask-constrained edits but likely allows for higher-quality images overall if skilled with prompt engineering. This is a nuance that is annoying to differentiate.
by mkl on 5/8/25, 2:21 AM
The lamp is put on a different desk in a totally different room, with AI mush in the foreground. Props for not cherry-picking a first example, I guess. The sofa colour one is somehow much better, with a less specific instruction.
by GaggiX on 5/7/25, 5:40 PM
Btw still not as good as ChatGPT but much, much faster, it's a nice progress compare to the previous model.
by thornewolf on 5/7/25, 5:30 PM
There are a lot of failure modes still but what I want is a very large cookbook showing what known-good workflows are. Since this is just so directly downstream of (limited) training data it might be that I am just prompting in a ever so slightly bad way.
by mNovak on 5/7/25, 6:31 PM
Seems to help if you explicitly describe the scene, but then the drawing-along aspect seem relatively pointless.
by Yiling-J on 5/7/25, 11:30 PM
You can see the full table with images here: https://tabulator-ai.notion.site/1df2066c65b580e9ad76dbd12ae...
I think the results came out quiet well. Be aware I don't generate a text prompt based on row data for image generation. Instead, the raw row data(ingredients, instructions...) and table metadata(column names and descriptions) are sent directly to gemini-2.0-flash-exp-image-generation.
by mvdtnz on 5/7/25, 10:17 PM
> Okay, I understand. You want me to replace ONLY the four windows located underneath the arched openings on the right side of the house with bifold doors, leaving all other features of the house unchanged. Here is the edited image:
Followed by no image. This is a behaviour I have seen many times from Gemini in the past so it's frustrating that it's still a problem.
I give this a 0/10 for my first use case.
by ohadron on 5/7/25, 5:32 PM
by pentagrama on 5/7/25, 7:22 PM
It seems like the real goal here, for Google and other AI companies, is a world flooded with endless AI-generated variants of objects that don’t even exist yet, crafted to be sold and marketed (probably by AI too) to hyper-targeted audiences. This feels like an incoming wave of "AI slop", mass-produced synthetic content, crashing against the small island of genuine human craftsmanship and real, existing objects.
by egamirorrim on 5/7/25, 5:29 PM
by qq99 on 5/7/25, 7:11 PM
by simonw on 5/7/25, 11:04 PM
by taylorhughes on 5/7/25, 7:18 PM
by voidUpdate on 5/8/25, 8:05 AM
by Tsarp on 5/8/25, 2:31 AM
If for example you use controlnets you can pretty much get very close to a style composition that you need with an open model like Flux that will be far better. Flux has a few successors coming up now
by emporas on 5/8/25, 2:15 AM
I take an image with some desired colors or typography from an already existing music album or from Ideogram's poster section. I pass it to gemini and give the command:
"describe the texture of the picture, all the element and their position in the picture, left side, center right side, up and down, the color using rgb, the artistic style and the calligraphy or font of the letters"
Then i take the result and pass it through an LLM, a different LLM because i don't like gemini that much, i find it is much less coherent than other models. I use qwen-qwq-32b usually and I take the description gemini outputs and give it to qwen:
" write a similar description, but this time i want a surreal painting with several imaginative colors. Follow the example of image description, add several new and beautiful shapes of all elements and give all details, every side which brushstrokes it uses, and rgb colors it uses, the color palette of the elements of the page, i want it to be a pastel painting like the example, and don't put bioluminesence. I want it to be old style retro style mystery sci fi. Also i want to have a title of "Song Title" and describe the artistic font it uses and it's position in the painting, it should be designed as a drum n bass album cover "*
Then i take the result and give it back to gemini with command: "Create an image with text "Song Title" for an album cover: here is the description of the rest of the album"
If the resulting image is good, then it is time to add font, i take the new image description and pass it through qwen again, supposing the image description has fields Title and Typography:
"rewrite the description and add full description of the letters and font of text, clean or distressed, jagged or fluid letters or any other property they might have, where they are overlayed, and make some new patterns about the letter appearance and how big they are and the material they are made of, rewrite the Title and Typography."
I replace the previous description's section Title and Typography with the new description and create images with beautiful fonts.
by jansan on 5/7/25, 5:27 PM
by adverbly on 5/7/25, 6:05 PM
Is it just me or is the market just absolutely terrible at understanding the implications and speed of progress behind what's happening right now in the walls of big G?
by cthulberg on 5/8/25, 10:08 AM
source: https://ai.google.dev/gemini-api/docs/models#gemini-2.0-flas... and my Google Ai Studio