from Hacker News

Create and edit images with Gemini 2.0 in preview

by meetpateltech on 5/7/25, 4:06 PM with 100 comments

by vunderba on 5/7/25, 7:21 PM
I've added/tested this multimodal Gemini 2.0 to my shoot-out of SOTA image gen models (OpenAI 4o, Midjourney 7, Flux, etc.) which contains a collection of increasingly difficult prompts.
https://genai-showdown.specr.net
I don't know how much of Google's original Imagen 3.0 is incorporated into this new model, but the overall aesthetic quality seems to be unfortunately significantly worse.
The big "wins" are:
- Multimodal aspect in trying to keep parity with OpenAI's offerings.
- An order of magnitude faster than OpenAI 4o image gen

by simonw on 5/7/25, 10:04 PM

Be a bit careful playing with this one. I tried this:

  curl -s -X POST \
    "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash-preview-image-generation:generateContent?key=$GEMINI_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "contents": [{
        "parts": [
          {"text": "Provide a vegetarian recipe for butter chicken but with chickpeas not chicken and include many inline illustrations along the way"}
        ]
      }],
      "generationConfig":{"responseModalities":["TEXT","IMAGE"]}
    }' > /tmp/out.json

And got back 41MB of JSON with 28 base64 images in it: https://gist.github.com/simonw/55894032b2c60b35f320b6a166ded...

At 4c per image that's more than a dollar on that single prompt.

I built this quick tool https://tools.simonwillison.net/gemini-image-json for pasting that JSON into to see it rendered.

by eminence32 on 5/7/25, 5:38 PM
This seems neat, I guess. But whenever I try tools like this, I often run into the limits of what I can describe in words. I might try something like "Add some clutter to the desk, including stacks of paper and notebooks" but when it doesn't quite look like what I want, I'm not sure what else to do except try slightly different wordings until the output happens to land on what I want.
I'm sure part of this is a lack of imagination on my part about how to describe the vague image in my own head. But I guess I have a lot of doubts about using a conversational interface for this kind of stuff
by refulgentis on 5/7/25, 5:48 PM
Another release from Google!
Now I can use:
- Gemini 2.0 Flash Image Generation Preview (May) instead of Gemini 2.0 Flash Image Generation Preview (March)
- or when I need text, Gemini 2.5 Flash Thinking 04-17 Preview ("natively multimodal" w/o image generation)
- When I need to control thinking budgets, I can do that with Gemini 2.5 Flash Preview 04-17, with not-thinking at a 50% price increase over a month prior
- And when I need realtime, fallback to Gemini 2.0 Flash 001 Live Preview (announced as In Preview on April 9 2025 after the Multimodal Live API was announced as released on December 11 2024)
- I can't control Gemini 2.5 Pro Experimental/Preview/Preview IO Edition's thinking budgets, but good news follows in the next bullet: they'll swap the model out underneath me with one that thinks ~10x less so at least its in the same cost ballpark as their competitors
- and we all got autoupgraded from Gemini 2.5 Pro Preview (03/25 released 4/2) to Gemini 2.5 Pro Preview (IO Edition) yesterday! Yay!
by cush on 5/7/25, 5:54 PM
The doodle demo is super fun
https://aistudio.google.com/apps/bundled/gemini-co-drawing?s...
by minimaxir on 5/7/25, 6:00 PM
Of note is that the per-image pricing for Gemini 2.0 image generation is $0.039 per image, which is more expensive than Imagen 3 ($0.03 per image): https://ai.google.dev/gemini-api/docs/pricing
The main difference is that Gemini does allow for incorporating a conversation to generate the image as demoed here, while Imagen 3 is a strict text-in/image-out with optional mask-constrained edits but likely allows for higher-quality images overall if skilled with prompt engineering. This is a nuance that is annoying to differentiate.
by mkl on 5/8/25, 2:21 AM
> what the lamp from the second image would look like on the desk from the first image
The lamp is put on a different desk in a totally different room, with AI mush in the foreground. Props for not cherry-picking a first example, I guess. The sofa colour one is somehow much better, with a less specific instruction.
by GaggiX on 5/7/25, 5:40 PM
Not available in the EU, first version was and then removed.
Btw still not as good as ChatGPT but much, much faster, it's a nice progress compare to the previous model.
by thornewolf on 5/7/25, 5:30 PM
Model outputs look good-ish. I think they are neat. I updated my recent hack project https://lifestyle.photo to the new model. It's middling-to-good.
There are a lot of failure modes still but what I want is a very large cookbook showing what known-good workflows are. Since this is just so directly downstream of (limited) training data it might be that I am just prompting in a ever so slightly bad way.
by mNovak on 5/7/25, 6:31 PM
I'm getting mixed results with the co-drawing demo, in terms of understanding what stick figures are, which seems pretty important for the 99% of us who can't draw a realistic human. I was hoping to sketch a scene, and let the model "inflate" it, but I ended up with 3D rendered stick figures.
Seems to help if you explicitly describe the scene, but then the drawing-along aspect seem relatively pointless.
by Yiling-J on 5/7/25, 11:30 PM
I generated 100 recipes with images using gemini-2.0-flash and gemini-2.0-flash-exp-image-generation as a demo of text+image generation in my open-source project: https://github.com/Yiling-J/tablepilot/tree/main/examples/10...
You can see the full table with images here: https://tabulator-ai.notion.site/1df2066c65b580e9ad76dbd12ae...
I think the results came out quiet well. Be aware I don't generate a text prompt based on row data for image generation. Instead, the raw row data(ingredients, instructions...) and table metadata(column names and descriptions) are sent directly to gemini-2.0-flash-exp-image-generation.
by mvdtnz on 5/7/25, 10:17 PM
I gave this a crack this morning, trying something very similar to the examples. I tried to get Gemini 2.0 Preview to add a set of bi-fold doors to a picture of a house in a particular place. It failed completely. It put them in the wrong place, they looked absolutely hideous (like I had pasted them in with MS Paint) and the more I tried to correct it with prompts the worse it got. At one point when I re-prompted it, it said
> Okay, I understand. You want me to replace ONLY the four windows located underneath the arched openings on the right side of the house with bifold doors, leaving all other features of the house unchanged. Here is the edited image:
Followed by no image. This is a behaviour I have seen many times from Gemini in the past so it's frustrating that it's still a problem.
I give this a 0/10 for my first use case.
by ohadron on 5/7/25, 5:32 PM
For one thing, it's way faster than the OpenAI equivalent in a way that might unlock additional use cases.
by pentagrama on 5/7/25, 7:22 PM
I want to take a step back and reflect on what this actually shows us. Look at the examples Google provides: it refers to the generated objects as "products", clearly pointing toward shopping or e-commerce use cases.
It seems like the real goal here, for Google and other AI companies, is a world flooded with endless AI-generated variants of objects that don’t even exist yet, crafted to be sold and marketed (probably by AI too) to hyper-targeted audiences. This feels like an incoming wave of "AI slop", mass-produced synthetic content, crashing against the small island of genuine human craftsmanship and real, existing objects.
by egamirorrim on 5/7/25, 5:29 PM
I don't understand how to use this, I keep trying to edit a photo (change a jacket to a t-shirt) of myself in the Gemini app with 2.0 flash selected and it just generated a new image that's nothing like the original
by qq99 on 5/7/25, 7:11 PM
Wasn't this already available in AI Studio? It sounds like they also improved the image quality. It's hard to keep up with what's new with all these versions
by simonw on 5/7/25, 11:04 PM
Posted some notes from trying this out here, including examples of the images it produced and a tool for rendering the JSON https://simonwillison.net/2025/May/7/gemini-images-preview/
by taylorhughes on 5/7/25, 7:18 PM
Image editing/compositing/remixing is not quite as good as gpt-image-1, but the results are really compelling anyway due to the dramatic increase in speed! Playing with it just now, it's often 5 seconds for a compositing task between multiple images. Feels totally different from waiting 30s+ for gpt-image-1.
by voidUpdate on 5/8/25, 8:05 AM
1 doesn't actually really show how the lamp would look in that situation... in the first image it's about the same height as the sofa. I'd expect it to be at least twice the size that it is in the second image. Also what is going on underneath the table?
by Tsarp on 5/8/25, 2:31 AM
There are direct prompt tests and then there are tests with tooling.
If for example you use controlnets you can pretty much get very close to a style composition that you need with an open model like Flux that will be far better. Flux has a few successors coming up now
by emporas on 5/8/25, 2:15 AM
I use gemini to create covers for songs/albums i make, with beautiful typography. Something like this [1]. I was dying of curiosity, how ideogram managed to create such gorgeous images. I figured it out 2 days ago.
I take an image with some desired colors or typography from an already existing music album or from Ideogram's poster section. I pass it to gemini and give the command:
"describe the texture of the picture, all the element and their position in the picture, left side, center right side, up and down, the color using rgb, the artistic style and the calligraphy or font of the letters"
Then i take the result and pass it through an LLM, a different LLM because i don't like gemini that much, i find it is much less coherent than other models. I use qwen-qwq-32b usually and I take the description gemini outputs and give it to qwen:
" write a similar description, but this time i want a surreal painting with several imaginative colors. Follow the example of image description, add several new and beautiful shapes of all elements and give all details, every side which brushstrokes it uses, and rgb colors it uses, the color palette of the elements of the page, i want it to be a pastel painting like the example, and don't put bioluminesence. I want it to be old style retro style mystery sci fi. Also i want to have a title of "Song Title" and describe the artistic font it uses and it's position in the painting, it should be designed as a drum n bass album cover "*
Then i take the result and give it back to gemini with command: "Create an image with text "Song Title" for an album cover: here is the description of the rest of the album"
If the resulting image is good, then it is time to add font, i take the new image description and pass it through qwen again, supposing the image description has fields Title and Typography:
"rewrite the description and add full description of the letters and font of text, clean or distressed, jagged or fluid letters or any other property they might have, where they are overlayed, and make some new patterns about the letter appearance and how big they are and the material they are made of, rewrite the Title and Typography."
I replace the previous description's section Title and Typography with the new description and create images with beautiful fonts.
[1] https://imgur.com/a/8TCUJ75
by jansan on 5/7/25, 5:27 PM
Some examples are quite impressive, but the one with the ice bear on the white mug is very underwhelming and the co-drawing looks like it was hacked together by a vibe coder.
by adverbly on 5/7/25, 6:05 PM
Google totally crushing it and stock is down 8% today :|
Is it just me or is the market just absolutely terrible at understanding the implications and speed of progress behind what's happening right now in the walls of big G?
by cthulberg on 5/8/25, 10:08 AM
gemini-2.0-flash-*-image-generation models are not currently supported in a number of countries in Europe, Middle East & Africa
source: https://ai.google.dev/gemini-api/docs/models#gemini-2.0-flas... and my Google Ai Studio