from Hacker News

Qwen VLo: From “Understanding” the World to “Depicting” It

by lnyan on 6/27/25, 2:35 PM with 52 comments

  • by rushingcreek on 6/27/25, 2:51 PM

    It doesn't seem to have open weights, which is unfortunate. One of Qwen's strengths historically has been their open-weights strategy, and it would have been great to have a true open-weights competitor to 4o's autoregressive image gen. There are so many interesting research directions that are only possible if we can get access to the weights.

    If Qwen is concerned about recouping its development costs, I suggest looking at BFL's Flux Kontext Dev release from the other day as a model: let researchers and individuals get the weights for free and let startups pay for a reasonably-priced license for commercial use.

  • by b0a04gl on 6/27/25, 3:54 PM

    image gets compressed into 256 tokens before language model sees it. ask it to add a hat and it redraws the whole face; because objects aren't stored as separate things. there's no persistent bear in memory. it all lives inside one fused latent soup, they're fresh samples under new constraints. every prompt tweak rebalances the whole embedding. that's why even small changes ripple across the image. i notice it like single shot scene synthesis, which is good for diff usecases
  • by hexmiles on 6/27/25, 3:36 PM

    While looking at the examples of editing the bear image, I noticed that the model seemed to change more things than were strictly asked.

    As an example, when asked to change the background, it also completely changed the bear (it has the same shirt but the fur and face are clearly different), and also: when it turned the bear in a balloon, it changed the background (removing the pavement) and lost the left seed in the watermelon.

    It is something that can be fixed with better prompting, or is it a limitation of the model/architecture?

  • by afro88 on 6/27/25, 8:29 PM

    Strangely the image change examples (edits, style transfer etc) have that slight yellow tint that GPT Image 1 (ChatGPT 4o's latest image model) has. Why is that? Flux Kontext doesn't seem to do that
  • by rickydroll on 6/27/25, 3:22 PM

    To my eyes, all these images hit the uncanny valley. All the colors and the shadows are just off.
  • by djaychela on 6/27/25, 3:41 PM

    How do you stop the auto reading out? Why can't websites just sit there and wait until I ask for them to do something? It full screen auto played a video on watch and then just started reading?

    Firefox on ios ftr

  • by frotaur on 6/27/25, 3:12 PM

    Anybody knows if there is a technical report for this, or for other models that generate images in a similar way? I'd really like to understand the architecture behind 4o-like image gen.
  • by godelski on 6/27/25, 6:24 PM

    As a ML researcher and a degree holding physicist, I'm really hesitant to use the words "understanding" and "describing" (much less hesitant) around these models. I don't find the language helpful and think it's mostly hateful tbh.

    The reason we use math in physics is because of its specificity. The same reason coding is so hard [0,1]. I think people aren't giving themselves enough credit here for how much they (you) understand about things. It is the nuances that really matter. There's so much detail here and we often forget how important they are because it is just normal to us. It's like forgetting about the ground you walk upon.

    I think something everyone should read about is Asimov's "Relativity of Wrong"[2]. This is what we want to see in these systems if we want to start claiming they understand things. We want to see them to deduction and abduction. To be able to refine concepts and ideas. To be able to discover things that are more than just a combination of things they've ingested. What's really difficult here is that we train these things on all human knowledge and just reciting back that knowledge doesn't demonstrate intelligence. It's very unlikely that they losslessly compress that knowledge into these model sizes, but without very deep investigation into that data and probing at this knowledge it is very hard to understand what it knows and what it memorizes. Really, this is a very poor way to go about trying to make intelligence[3], or at least making intelligence and ending up knowing it is intelligent.

    To really "understand" things we need to be able to propose counterfactuals[4]. Every physics statement is a counterfactual statement. Take F=ma as a trivial example. We can modify the mass or the acceleration to our heart's content and still determine the force. We can observe a specific mass moving at a specific acceleration and then ask the counterfactual "what if it was twice as heavy?" (twice the mass). *We can answer that!* In fact, your mental model of the world does this too! Yo may not be describing it with math (maybe you are ;) but you are able to propose counterfactuals and do a pretty good job a lot of the time. Doesn't mean you always need to be right though. But the way our heads work is through these types of systems. You daydream these things, you imagine them while you play, and all sorts of things. This, I can say, with high confidence, is not something modern ML (AI) systems do.

      == Edit ==
    
    A good example of lack of understanding is the image OP uses. Not only does the right have the wrong number of fingers but look at the keys on the keyboard. It does not take much understanding to recognize that you shouldn't have repeated keys... the configuration is all wonky too, like one of those dreams you can immediately tell is a dream[5]. I'd also be willing to bet that the number of keys doesn't align to the number of markers and definitely the sizing looks off. The more you look at it the worse it gets, and that's really common among these systems. Nice at a quick glance but DEEP in the uncanny valley at more than a glance and deeper the more you look.

    [0] https://youtube.com/watch?v=cDA3_5982h8

    [1] Code is math. There's an isomorphism between Turing complete languages and computable mathematics. You can look more into my namesake, church, and Turing if you want to get more formal or wait for the comment that corrects a nuanced mistake here (yes, it exists). Also, note that physics and math are not the same thing, but mathematics is unreasonably effective (yes, this is a reference).

    [2] https://hermiene.net/essays-trans/relativity_of_wrong.html

    [3] This is a very different statement than "making something useful." Without a doubt these systems are useful. Do not conflate these

    [4] https://en.wikipedia.org/wiki/Counterfactual_thinking

    [5] Yes, you can read in dreams. I do it frequently. Though on occasion I have lucid dreamed because I read something and noticed that it changed when I looked away and looked back.

  • by veltas on 6/27/25, 4:52 PM

    Rather I think machine learning has made a lot more progress 'depicting' the world than 'understanding' it.
  • by skybrian on 6/27/25, 3:33 PM

    I tried the obligatory pelican riding a bicycle (as an image, not SVG) and some accordion images. It has a bit of trouble with fingers and wth getting the black keys right. It’s fairly fast.

    https://chat.qwen.ai/s/0f9d558c-2108-4350-98fb-6ee87065d587?...

  • by aredox on 6/27/25, 3:09 PM

    It don't think these words mean what they think they do...