by linksbro on 2/16/24, 12:38 AM with 168 comments
by empath-nirvana on 2/16/24, 1:52 AM
Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.
You can probably already imagine different ways to wire the output to text generation and controlling its own motions, etc, and predicting outcomes based on actions it, itself could plausibly take, and choosing the best one.
It doesn't actually have to generate realistic imagery or imagery that doesn't have any mistakes or imagery that's high definition to be used in that way. How realistic is our own imagination of the world?
Edit: I'm going to add a specific case. Imagine a house cleaning robot. It starts with an image of your living room. Then it creates a image of your living room after it's been cleaned. Then it interpolates a video _imagining itself cleaning the room_, then acts as much as it can to mimic what's in the video, then generates a new continuation, then acts, and so on. Imagine doing that several times a second, if necessary.
by SushiHippie on 2/16/24, 1:43 AM
For example, the surfer is surfing in the air at the end:
https://cdn.openai.com/tmp/s/prompting_7.mp4
Or this "breaking" glass that does not break, but spills liquid in some weird way:
https://cdn.openai.com/tmp/s/discussion_0.mp4
Or the way this person walks:
https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-a...
Or wherever this map is coming from:
https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls...
by modeless on 2/16/24, 3:03 AM
So this is why they haven't shown Will Smith eating spaghetti.
> These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world
This is exciting for robotics. But an even closer application would be filling holes in gaussian splatting scenes. If you want to make a 3D walkthrough of a space you need to take hundreds to thousands of photos with seamless coverage of every possible angle, and you're still guaranteed to miss some. Seems like a model this capable could easily produce plausible reconstructions of hidden corners or close up detail or other things that would just be holes or blurry parts in a standard reconstruction. You might only need five or ten regular photos of a place to get a completely seamless and realistic 3D scene that you could explore from any angle. You could also do things like subtract people or other unwanted objects from the scene. Such an extrapolated reconstruction might not be completely faithful to reality in every detail, but I think this could enable lots of applications regardless.
by nopinsight on 2/16/24, 1:37 AM
“Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.”
General, superhuman robotic capabilities on the software side can be achieved once such a simulator is good enough. (Whether that can be achieved with this approach is still not certain.)Why superhuman? Larger context length than our working memory is an obvious one, but there will likely be other advantages such as using alternative sensory modalities and more granular simulation of details unfamiliar to most humans.
by guybedo on 2/16/24, 2:09 AM
The results really are impressive. Being able to generate such high quality videos, to extend videos in the past and in the future shows how much the model "understands" the real world, objects interaction, 3D composition, etc...
Although image generation already requires the model to know a lot about the world, i think there's really a huge gap with video generation where the model needs to "know" 3D, objects movements and interactions.
by iliane5 on 2/16/24, 2:50 AM
I can't wait to play with this but I can't even imagine how expensive it must be. They're training in full resolution and can generate up to a minute of video.
Seeing how bad video generation was, I expected it would take a few more years to get to this but it seems like this is another case of "Add data & compute"(TM) where transformers prove once again they'll learn everything and be great at it
by data-ottawa on 2/16/24, 1:36 AM
The robot examples are very underwhelming, but the people and background people are all very well done, and at a level much better than most static image diffusion models produce. Generating the same people as the interact with objects is also not something I expected a model like this to do well so soon.
by lairv on 2/16/24, 1:09 AM
by pedrovhb on 2/16/24, 1:56 AM
by anonyfox on 2/16/24, 9:29 AM
I know its a delicate topic people (especially in the US) don't want to speak about at all, but damn, this is a giant market and could do humanity good if done well.
by zone411 on 2/16/24, 6:59 AM
by dang on 2/16/24, 3:31 AM
Sora: Creating video from text - https://news.ycombinator.com/item?id=39386156 - Feb 2024 (1430 comments)
by GaggiX on 2/16/24, 2:16 AM
by koonsolo on 2/16/24, 11:34 AM
Plus, it could generate it in real time and take my responses into account. I look bored? Spice it up, etc.
Today such a thing seems closer than I thought.
by binary132 on 2/16/24, 3:12 AM
by chankstein38 on 2/16/24, 2:57 PM
by proc0 on 2/16/24, 4:30 AM
by myth_drannon on 2/16/24, 2:03 AM
by colesantiago on 2/16/24, 1:36 AM
Edit, changed the links to the direct ones!
by pmontra on 2/16/24, 9:06 AM
by htrp on 2/16/24, 4:13 PM
Every cv preprocessing pipeline is in shambles now.
by vunderba on 2/16/24, 1:41 AM
by sjwhevvvvvsj on 2/16/24, 1:40 AM
Still, god damn.
by danavar on 2/16/24, 2:31 AM
Reasoning, logic, formal systems, and physics exist in a seemingly completely different, mathematical space than pure video.
This is just a contrived, interesting viewpoint of the technology, right?
by newswasboring on 2/16/24, 4:41 PM
> Other interactions, like eating food, do not always yield correct changes in object state
Can this be because we just don't shoot a lot of people eating? I think it is general advice to not show people eating on camera for various reasons. I wonder if we know if that kind of topic bias exists in the dataset.
by anirudhv27 on 2/16/24, 3:37 PM
by pellucide on 2/16/24, 3:44 AM
Is this generating videos as streaming content e.g. like a mp4 video. As far as I can see, it is doing that. Is it possible for AI to actually produce the 3d models?
What kind of compute resources are required to produce the 3d models.
by jk_tech on 2/16/24, 2:27 AM
1. High quality video or image from text 2. Taking in any content as input and generating forwards/backwards in time 3. Style transformation 4. Digital World simulation!
by exe34 on 2/17/24, 2:20 PM
by neurostimulant on 2/16/24, 7:15 AM
by lbrito on 2/16/24, 2:28 AM
by blueprint on 2/16/24, 3:41 AM
so they're gonna include the never-before-observed-but-predicted Unruh effect, as well? and other quantum theory? cool..
> For example, it does not accurately model the physics of many basic interactions, like glass shattering.
... oh
Isn't all of the training predicated on visible, gathered data - rather than theory? if so, I don't think it's right to call these things simulators of the physical world if they don't include physical theory. DFT at least has some roots in theory.
by liuliu on 2/16/24, 1:45 AM
by yakito on 2/16/24, 2:32 PM
by tokai on 2/16/24, 11:45 AM
by advael on 2/16/24, 8:37 AM
Some of the reasons the less optimistic scenarios seem likely is that the kinds of extrapolation errors this model makes are of similar character to those of LLMs: extrapolation follows a gradient of smooth apparent transitions rather than some underlying logic about the objects portrayed, and sometimes seems to just sort of ignore situations that are far enough outside of what it's seen rather than reconcile them. For example, the tidal wave/historical hall example is a scenario unlikely to have been in the training data. Sure, there's the funny bit at the end where the surfer appears to levitate in the air, but there's a much larger issue with how these two contrasting scenes interact, or rather fail to. What we see looks a lot more like a scene of surfing superimposed via photoshop or something on a still image of the hall, as there's no evidence of the water interacting with the seats or walls in the hall at all. The model will just roll with whatever you tell it to do as best it can, but it's not doing something like modeling "what would happen if" that implausible scenario played out, and even doing it poorly would be a better sign for this doing something like "simulating" the described scenario. Instead, we have impressive results for prompts that likely strongly correspond to scenes the model may have seen, and evidence of a lack of composition in cases where a particular composition is unlikely to have been seen and needs some underlying understanding of how it "would" work that is visible to us
by bawana on 2/16/24, 1:57 PM
by RayVR on 2/16/24, 1:38 AM