from Hacker News

Video generation models as world simulators

by linksbro on 2/16/24, 12:38 AM with 168 comments

by empath-nirvana on 2/16/24, 1:52 AM
I think people might be missing what this enables. It can make plausible continuations of video, with realistic physics. What happens if this gets fast enough to work _in real time_.
Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.
You can probably already imagine different ways to wire the output to text generation and controlling its own motions, etc, and predicting outcomes based on actions it, itself could plausibly take, and choosing the best one.
It doesn't actually have to generate realistic imagery or imagery that doesn't have any mistakes or imagery that's high definition to be used in that way. How realistic is our own imagination of the world?
Edit: I'm going to add a specific case. Imagine a house cleaning robot. It starts with an image of your living room. Then it creates a image of your living room after it's been cleaned. Then it interpolates a video _imagining itself cleaning the room_, then acts as much as it can to mimic what's in the video, then generates a new continuation, then acts, and so on. Imagine doing that several times a second, if necessary.
by SushiHippie on 2/16/24, 1:43 AM
I like that this one shows some "fails", and not just the top of the top results:
For example, the surfer is surfing in the air at the end:
https://cdn.openai.com/tmp/s/prompting_7.mp4
Or this "breaking" glass that does not break, but spills liquid in some weird way:
https://cdn.openai.com/tmp/s/discussion_0.mp4
Or the way this person walks:
https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-a...
Or wherever this map is coming from:
https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls...
by modeless on 2/16/24, 3:03 AM
> Other interactions, like eating food, do not always yield correct changes in object state
So this is why they haven't shown Will Smith eating spaghetti.
> These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world
This is exciting for robotics. But an even closer application would be filling holes in gaussian splatting scenes. If you want to make a 3D walkthrough of a space you need to take hundreds to thousands of photos with seamless coverage of every possible angle, and you're still guaranteed to miss some. Seems like a model this capable could easily produce plausible reconstructions of hidden corners or close up detail or other things that would just be holes or blurry parts in a standard reconstruction. You might only need five or ten regular photos of a place to get a completely seamless and realistic 3D scene that you could explore from any angle. You could also do things like subtract people or other unwanted objects from the scene. Such an extrapolated reconstruction might not be completely faithful to reality in every detail, but I think this could enable lots of applications regardless.
by nopinsight on 2/16/24, 1:37 AM
AlphaGo and AlphaZero were able to achieve superhuman performance due to the availability of perfect simulators for the game of Go. There is no such simulator for the real world we live in (although pure LLMs sort of learn a rough, abstract representation of the world as perceived by humans.) Sora is an attempt to build such a simulator using deep learning.
```
  “Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.”
```
General, superhuman robotic capabilities on the software side can be achieved once such a simulator is good enough. (Whether that can be achieved with this approach is still not certain.)
Why superhuman? Larger context length than our working memory is an obvious one, but there will likely be other advantages such as using alternative sensory modalities and more granular simulation of details unfamiliar to most humans.
by guybedo on 2/16/24, 2:09 AM
I think it's Ylecun who stated a few times that video was the better way to train large models as it's more information dense.
The results really are impressive. Being able to generate such high quality videos, to extend videos in the past and in the future shows how much the model "understands" the real world, objects interaction, 3D composition, etc...
Although image generation already requires the model to know a lot about the world, i think there's really a huge gap with video generation where the model needs to "know" 3D, objects movements and interactions.
by iliane5 on 2/16/24, 2:50 AM
Watching an entirely generated video of someone painting is crazy.
I can't wait to play with this but I can't even imagine how expensive it must be. They're training in full resolution and can generate up to a minute of video.
Seeing how bad video generation was, I expected it would take a few more years to get to this but it seems like this is another case of "Add data & compute"(TM) where transformers prove once again they'll learn everything and be great at it
by data-ottawa on 2/16/24, 1:36 AM
I know the main post has been getting a lot of reaction, but this page absolutely blew me away. The results are striking.
The robot examples are very underwhelming, but the people and background people are all very well done, and at a level much better than most static image diffusion models produce. Generating the same people as the interact with objects is also not something I expected a model like this to do well so soon.
by lairv on 2/16/24, 1:09 AM
I find it wild that this model does not have explicit 3D prior, yet learns to generate videos with such 3D consistency, you can directly train a 3D representation (NeRF-like) from those videos: https://twitter.com/BenMildenhall/status/1758224827788468722
by pedrovhb on 2/16/24, 1:56 AM
That's an interesting idea. Analogous to how LLMs are simply "text predictors" but end up having to learn a model of language and the world to correctly predict cohesive text, it makes sense that "video predictors" also have to learn a model of the world that makes sense. I wonder how many orders of magnitude further they have to evolve to be similarly useful.
by anonyfox on 2/16/24, 9:29 AM
If they would allow this (maybe a premium+ model) they could soon destroy the whole porn industry. not the websites, but the (often abused) sex workers. Everyone could describe that fetish they are into and get it visualized instantly without the need of physical human suffering to produce these videos.
I know its a delicate topic people (especially in the US) don't want to speak about at all, but damn, this is a giant market and could do humanity good if done well.
by zone411 on 2/16/24, 6:59 AM
Video will be especially important for language models to grasp physical actions that are instinctive and obvious to humans but not explicitly detailed in text or video captions. I mentioned this in 2022:
https://twitter.com/LechMazur/status/1607929403421462528
https://twitter.com/LechMazur/status/1619032477951213568
by dang on 2/16/24, 3:31 AM
Related ongoing thread:
Sora: Creating video from text - https://news.ycombinator.com/item?id=39386156 - Feb 2024 (1430 comments)
by GaggiX on 2/16/24, 2:16 AM
The Minecraft demo makes me think that soon will be playing games directly from the output of one of these models, unlimited content.
by koonsolo on 2/16/24, 11:34 AM
Yesterday I was watching a movie on Netflix and thought to myself, what if Netflix generated a movie based on what I want to see and what I like.
Plus, it could generate it in real time and take my responses into account. I look bored? Spice it up, etc.
Today such a thing seems closer than I thought.
by binary132 on 2/16/24, 3:12 AM
Maybe this says more about me than about the technology, but I found the consistency of the Minecraft simulation super impressive.
by chankstein38 on 2/16/24, 2:57 PM
This is the second Sora announcement I've seen. Am I missing how I can play with it? The examples in the papers are all well and good but I want to get my hands on it and try it.
by proc0 on 2/16/24, 4:30 AM
I don't know if there is research into this, didn't see it mentioned here, but this is the most probable path to something like AI consciousness and AGI. Of course it's highly speculative but video to world simulation is how the brain evolved and probably what is needed to have a robot behave like a living being. It would just do this in reverse, video input to inner world model, and use that for reasoning about the world. Extremely fascinating, and also scary this is happening so quickly.
by myth_drannon on 2/16/24, 2:03 AM
Should I short all the 3d tool s/movies/vfx companies?
by colesantiago on 2/16/24, 1:36 AM
Damn, even minecraft videos being simulated, this is crazy to see from OpenAI.
Edit, changed the links to the direct ones!
https://cdn.openai.com/tmp/s/simulation_6.mp4
https://cdn.openai.com/tmp/s/simulation_7.mp4
by pmontra on 2/16/24, 9:06 AM
The video with the two MTBs going downhill: it seems to me that the long left turn that begins a few second into the video is way too long. It's easy to misjudge that kind of things (try to draw a road race track by looking at a single lap of it) but it could end up below the point where it started, or too close to it to be physically realistic. I was expecting to see a right turn at any moment but it kept going left. It could be another consequence of the lack of real knowledge about the world, similar to the glass shattering example at the end of the article.
by htrp on 2/16/24, 4:13 PM
> We empirically find that training on videos at their native aspect ratios improves composition and framing. We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right)s have improved framing.
Every cv preprocessing pipeline is in shambles now.
by vunderba on 2/16/24, 1:41 AM
The improvement to temporal consistency given that the length of these generated videos is 3 to 4 times longer than anything else on the market (runway, pika, etc) is truly remarkable.
by sjwhevvvvvsj on 2/16/24, 1:40 AM
This is insanely good but look at the legs around 16 seconds in, they kinda morph through each other. Generally the legs are slightly unnerving.
Still, god damn.
by danavar on 2/16/24, 2:31 AM
While the Sora videos are impressive, are these really world simulators? While some notion of real-world physics probably exists somewhere within the model, doesn’t all the completely artificial training data corrupt it?
Reasoning, logic, formal systems, and physics exist in a seemingly completely different, mathematical space than pure video.
This is just a contrived, interesting viewpoint of the technology, right?
by newswasboring on 2/16/24, 4:41 PM
This is a totally silly thought, but I still want to get it out there.
> Other interactions, like eating food, do not always yield correct changes in object state
Can this be because we just don't shoot a lot of people eating? I think it is general advice to not show people eating on camera for various reasons. I wonder if we know if that kind of topic bias exists in the dataset.
by anirudhv27 on 2/16/24, 3:37 PM
What makes OpenAI so far ahead of all of these other research firms (or even startups like Pika, Runway, etc.)? I feel like I see so many examples of fields where progress is being made all across and OpenAI suddenly swoops in with an insane breakthrough lightyears ahead of everyone else.
by pellucide on 2/16/24, 3:44 AM
I am a newbie to this area. Honest questions:
Is this generating videos as streaming content e.g. like a mp4 video. As far as I can see, it is doing that. Is it possible for AI to actually produce the 3d models?
What kind of compute resources are required to produce the 3d models.
by jk_tech on 2/16/24, 2:27 AM
This is some incredible and fascinating work! The applications seem endless.
1. High quality video or image from text 2. Taking in any content as input and generating forwards/backwards in time 3. Style transformation 4. Digital World simulation!
by exe34 on 2/17/24, 2:20 PM
The current development of AI seems like speed run of Crystal Society in terms of their interaction with the world. The only thing missing is the Inner Purpose.
by neurostimulant on 2/16/24, 7:15 AM
Where's the training data come from? Youtube?
by lbrito on 2/16/24, 2:28 AM
Okay, The Matrix can't be too far away now.
by blueprint on 2/16/24, 3:41 AM
> Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
so they're gonna include the never-before-observed-but-predicted Unruh effect, as well? and other quantum theory? cool..
> For example, it does not accurately model the physics of many basic interactions, like glass shattering.
... oh
Isn't all of the training predicated on visible, gathered data - rather than theory? if so, I don't think it's right to call these things simulators of the physical world if they don't include physical theory. DFT at least has some roots in theory.
by liuliu on 2/16/24, 1:45 AM
Wow, this is really just scale-up DiT. We are going to see tons of similar models very soon.
by yakito on 2/16/24, 2:32 PM
Does anyone know why most of the videos are in slow motion?
by tokai on 2/16/24, 11:45 AM
Ugh, AI generated images everywhere is already annoying enough. Now we're gonna have these factitious videos clogging up everything, and I'll have to explain my old neighbor that Biden did infact not eat a fetus again and again.
by advael on 2/16/24, 8:37 AM
People are obviously already pointing out the errors in various physical interactions shown in the demo videos, including the research team themselves, and I think the plausiblity of the generated videos will likely improve as they work on the model more. However, I think the major reason this generation -> simulation leap might be harder leap than they think is actually a plausibility/accuracy distinction. Generative models are general and versatile compared to predictive models, but they're intrinsically learning an objective that assesses its extrapolations on spatial or sequential (or in the case of video, both) plausibility, which has a lot more degrees of freedom than accuracy. In other words, the ability to create reasonable-enough hypotheses for what the next frame or the next pixel over could end up not being enough. The optimistic scenario is that it's possible to get to a simulation by narrowing this hypothesis-space enough to accurately model reality. In other words, it's possible that this is just something that could fall out of the plausibility being continuously improved, like the subset of plausible hypotheses shrinks as the model gets better, and eventually we get a reality-predictor, but I think there are good reasons to think that's far from guaranteed. I'd be curious to see what happens if you restrict training data to unaltered camera footage rather than allowing anything fictitious, but the least optimistic possibility is that this kind of capability is necessary but not sufficient for adequate prediction (or slightly more optimistically, can only do so with amounts of resolution that are currently infeasible, or something).
Some of the reasons the less optimistic scenarios seem likely is that the kinds of extrapolation errors this model makes are of similar character to those of LLMs: extrapolation follows a gradient of smooth apparent transitions rather than some underlying logic about the objects portrayed, and sometimes seems to just sort of ignore situations that are far enough outside of what it's seen rather than reconcile them. For example, the tidal wave/historical hall example is a scenario unlikely to have been in the training data. Sure, there's the funny bit at the end where the surfer appears to levitate in the air, but there's a much larger issue with how these two contrasting scenes interact, or rather fail to. What we see looks a lot more like a scene of surfing superimposed via photoshop or something on a still image of the hall, as there's no evidence of the water interacting with the seats or walls in the hall at all. The model will just roll with whatever you tell it to do as best it can, but it's not doing something like modeling "what would happen if" that implausible scenario played out, and even doing it poorly would be a better sign for this doing something like "simulating" the described scenario. Instead, we have impressive results for prompts that likely strongly correspond to scenes the model may have seen, and evidence of a lack of composition in cases where a particular composition is unlikely to have been seen and needs some underlying understanding of how it "would" work that is visible to us
by bawana on 2/16/24, 1:57 PM
SORA.....the entire movie industry is now out of a job.
by RayVR on 2/16/24, 1:38 AM
If there's one thing I've always wanted, it's shitty video knockoffs of real life. Can't wait to stream some AI hallucinations.