from Hacker News

Diffusion models are real-time game engines

by jmorgan on 8/28/24, 2:59 AM with 409 comments

  • by vessenes on 8/28/24, 3:29 AM

    So, this is surprising. Apparently there’s more cause, effect, and sequencing in diffusion models than what I expected, which would be roughly ‘none’. Google here uses SD 1.4, as the core of the diffusion model, which is a nice reminder that open models are useful to even giant cloud monopolies.

    The two main things of note I took away from the summary were: 1) they got infinite training data using agents playing doom (makes sense), and 2) they added Gaussian noise to source frames and rewarded the agent for ‘correcting’ sequential frames back, and said this was critical to get long range stable ‘rendering’ out of the model.

    That last is intriguing — they explain the intuition as teaching the model to do error correction / guide it to be stable.

    Finally, I wonder if this model would be easy to fine tune for ‘photo realistic’ / ray traced restyling — I’d be super curious to see how hard it would be to get a ‘nicer’ rendering out of this model, treating it as a doom foundation model of sorts.

    Anyway, a fun idea that worked! Love those.

  • by wkcheng on 8/28/24, 3:42 AM

    It's insane that that this works, and that it works fast enough to render at 20 fps. It seems like they almost made a cross between a diffusion model and an RNN, since they had to encode the previous frames and actions and feed it into the model at each step.

    Abstractly, it's like the model is dreaming of a game that it played a lot of, and real time inputs just change the state of the dream. It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

  • by SeanAnderson on 8/28/24, 3:55 PM

    After some discussion in this thread, I found it worth pointing out that this paper is NOT describing a system which receives real-time user input and adjusts its output accordingly, but, to me, the way the abstract is worded heavily implied this was occurring.

    It's trained on a large set of data in which agents played DOOM and video samples are given to users for evaluation, but users are not feeding inputs into the simulation in real-time in such a way as to be "playing DOOM" at ~20FPS.

    There are some key phrases within the paper that hint at this such as "Key questions remain, such as ... how games would be effectively created in the first place, including how to best leverage human inputs" and "Our end goal is to have human players interact with our simulation.", but mostly it's just the omission of a section describing real-time user gameplay.

  • by zzanz on 8/28/24, 3:32 AM

    The quest to run doom on everything continues. Technically speaking, isn't this the greatest possible anti-Doom, the Doom with the highest possible hardware requirement? I just find it funny that on a linear scale of hardware specification, Doom now finds itself on both ends.
  • by godelski on 8/28/24, 8:43 AM

    Doom system requirements:

      - 4 MB RAM
      - 12 MB disk space 
    
    Stable diffusion v1

      > 860M UNet and CLIP ViT-L/14 (540M)
      Checkpoint size:
        4.27 Gb 
        7.7 GB (full EMA)
      Running on a TPU-v5e
        Peak compute per chip (bf16)  197 TFLOPs
        Peak compute per chip (Int8)  393 TFLOPs
        HBM2 capacity and bandwidth  16 GB, 819 GBps
        Interchip Interconnect BW  1600 Gbps
    
    This is quite impressive, especially considering the speed. But there's still a ton of room for improvement. It seems it didn't even memorize the game despite having the capacity to do so hundreds of times over. So we definitely have lots of room for optimization methods. Though who knows how such things would affect existing tech since the goal here is to memorize.

    What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute. I'm curious what the comparison in cost and time would be if you hired an engineer to reverse engineer Doom (how much prior knowledge do they get considering pertained models and visdoom environment. Was doom source code in T5? And which vit checkpoint was used? I can't keep track of Google vit checkpoints).

    I would love to see the checkpoint of this model. I think people would find some really interesting stuff taking it apart.

    - https://www.reddit.com/r/gaming/comments/a4yi5t/original_doo...

    - https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin...

    - https://cloud.google.com/tpu/docs/v5e

    - https://github.com/Farama-Foundation/ViZDoom

    - https://zdoom.org/index

  • by Sohcahtoa82 on 8/28/24, 4:48 PM

    It's always fun reading the dead comments on a post like this. People love to point how how pointless this is.

    Some of ya'll need to learn how to make things for the fun of making things. Is this useful? No, not really. Is it interesting? Absolutely.

    Not everything has to be made for profit. Not everything has to be made to make the world a better place. Sometimes, people create things just for the learning experience, the challenge, or they're curious to see if something is possible.

    Time spent enjoying yourself is never time wasted. Some of ya'll are going to be on your death beds wishing you had allowed yourself to have more fun.

  • by HellDunkel on 8/28/24, 8:35 AM

    Although impressive i must disagree. Diffusion models are not game engines. A game engine is a component to propell your game (along the time axis?). In that sense it is similar to the engine of the car, hence the name. It does not need a single working car nor a road to drive on do its job. The above is a dynamic, interactive replication of what happens when you put a car on a given road, requiring a million test drives with working vehicles. An engine would also work offroad.
  • by refibrillator on 8/28/24, 3:59 AM

    There is no text conditioning provided to the SD model because they removed it, but one can imagine a near future where text prompts are enough to create a fun new game!

    Yes they had to use RL to learn what DOOM looks like and how it works, but this doesn’t necessarily pose a chicken vs egg problem. In the same way that LLMs can write a novel story, despite only being trained on existing text.

    IMO one of the biggest challenges with this approach will be open world games with essentially an infinite number of possible states. The paper mentions that they had trouble getting RL agents to completely explore every nook and corner of DOOM. Factorio or Dwarf Fortress probably won’t be simulated anytime soon…I think.

  • by danjl on 8/28/24, 3:34 AM

    So, diffusion models are game engines as long as you already built the game? You need the game to train the model. Chicken. Egg?
  • by dtagames on 8/28/24, 1:05 PM

    A diffusion model cannot be a game engine because a game engine can be used to create new games and modify the rules of existing games in real time -- even rules which are not visible on-screen.

    These tools are fascinating but, as with all AI hype, they need a disclaimer: The tool didn't create the game. It simply generated frames and the appearance of play mechanics from a game it sampled (which humans created).

  • by alkonaut on 8/28/24, 1:28 PM

    The job of the game engine is also to render the world given only the worlds properties (textures, geometries, physics rules, ...), and not given "training data that had to be supplied from an already written engine".

    I'm guessing that the "This door requires a blue key" doesn't mean that the user can run around, the engine dreams up a blue key in some other corner of the map, and the user can then return to the door and the engine now opens the door? THAT would be impressive. It's interesting to think that all that would be required for that task to go from really hard to quite doable, would be that the door requiring the blue key is blue, and the UI showing some icon indicating the user possesses the blue key. Without that, it becomes (old) hidden state.

  • by helloplanets on 8/28/24, 6:05 AM

    So, any given sequence of inputs is rebuilt into a corresponding image, twenty times per second. I wonder how separate the game logic and the generated graphics are in the fully trained model.

    Given a sufficient enough separation between these two, couldn't you basically boil the game/input logic down to an abstract game template? Meaning, you could just output a hash that corresponds to a specific combination of inputs, and then treat the resulting mapping as a representation of a specific game's inner workings.

    To make it less abstract, you could save some small enough snapshot of the game engine's state for all given input sequences. This could make it much less dependent to what's recorded off of the agents' screens. And you could map the objects that appear in the saved states to graphics, in a separate step.

    I imagine this whole system would work especially well for games that only update when player input is given: Games like Myst, Sokoban, etc.

  • by panki27 on 8/28/24, 8:38 AM

    > Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation.

    I can hardly believe this claim, anyone who has played some amount of DOOM before should notice the viewport and textures not "feeling right", or the usually static objects moving slightly.

  • by golol on 8/28/24, 8:10 AM

    What I understand is the folloeing: If this works so well, why didn't we have good video generation much earlier? After diffusion models were seen to work the most obvious thing to do was to generate the next frame based on previous framrs but... it took 1-2 years for good video models to appear. For example compare Sora generating minecraft video versus this method generating minecraft video. Say in both cases the player is standing on a meadow with fee inputs and watching some pigs. In the Sora video you'd expect the typical glitched to appear, like erratic, sliding movement, overlapping legs, multiplication of pigs etc. Would these glitches not appear in the GameNGen video? Why?
  • by mo_42 on 8/28/24, 5:54 AM

    An implementation of the game engine in the model itself is theoretically the most accurate solution for predicting the next frame.

    I'm wondering when people will apply this to other areas like the real world. Would it learn the game engine of the universe (ie physics)?

  • by icoder on 8/28/24, 8:38 AM

    This is impressive. But at the same time, it can't count. We see this every time, and I understand why it happens, but it is still intriguing. We are so close or in some ways even way beyond, and yet at the same time so extremely far away, from 'our' intelligence.

    (I say it can't count because there are numerous examples where the bullet count glitches, it goes right impressively often, but still, counting, being up or down, is something computers have been able to do flawlessly basically since forever)

    (It is the same with chess, where the LLM models are becoming really good, yet sometimes make mistakes that even my 8yo niece would not make)

  • by lIl-IIIl on 8/28/24, 8:06 AM

    How does it know how many times it needs to shoot the zombie before it dies?

    Most enemies have enough hit points to survive the first shot. If the model is only trained on the previous frame, it doesn't know how many times the enemy was already shot at.

    From the video it seems like it is probability based - they may die right away or it might take way longer than it should.

    I love how the player's health goes down when he stands in the radioactive green water.

    In Doom the enemies fight with each other if they accidentally incur "friendly fire". It would be interesting to see it play out in this version.

  • by masterspy7 on 8/28/24, 3:49 AM

    There's been a ton of work to generate assets for games using AI: 3d models, textures, code, etc. None of that may even be necessary with a generative game engine like this! If you could scale this up, train on all games in existence, etc. I bet some interesting things would happen
  • by nolist_policy on 8/28/24, 7:16 AM

    Makes me wonder... If you stand still in front of a door so all past observations only contain that door, will the model teleport you to another level when opening the door?
  • by smusamashah on 8/28/24, 10:38 AM

    Has this model actually learned the 3d space of the game? Is it possible to break the camera free and roam around the map freely and view it from different angles?

    I noticed a few hallucinations e.g. when it picked green jacket from a corner, walking back it generated another corner. Therefore I don't think it has any clue about the 3D world of the game at all.

  • by ravetcofx on 8/28/24, 3:39 AM

    There is going to be a flood of these dreamlike "games" in the next few years. This feels likes a bit of a breakthrough in the engineering of these systems.
  • by Kapura on 8/28/24, 6:45 PM

    What is useful about this? I am a game programmer, and I cannot imagine a world where this improves any part of the development process. It seems to me to be a way to copy a game without literally copying the assets and code; plagiarism with extra steps. What am I missing?
  • by arduinomancer on 8/28/24, 5:13 AM

    How does the model “remember” the whole state of the world?

    Like if I kill an enemy in some room and walk all the way across the map and come back, would the body still be there?

  • by rldjbpin on 8/29/24, 8:48 AM

    this is truly a cool demo, but a very misleading title.

    to me it seems like a very bruteforce or greedy way to give the impression to a user that they are "playing" a game. the difference being that you already own the game to make this possible, but cannot let the user use that copy!

    using generative AI for game creation is at a nascent stage but there are much more elegant ways to go about the end goal. perhaps in the future with computing so far ahead that we moved beyond the current architecture, this might be worth doing instead of emulation perhaps.

  • by dabochen on 8/28/24, 1:32 PM

    So there is no interactivity, but the generated content is not the exact view in the training data, is this the correct understanding?

    If so, is it more like imagination/hallucination rather than rendering?

  • by rrnechmech on 8/28/24, 3:43 PM

    > To mitigate auto-regressive drift during inference, we corrupt context frames by adding Gaussian noise to encoded frames during training. This allows the network to correct information sampled in previous frames, and we found it to be critical for preserving visual stability over long time periods.

    I get this (mostly). But would any kind soul care to elaborate on this? What is this "drift" they are trying to avoid and how does (AFAIU) adding noise help?

  • by jamilton on 8/28/24, 7:32 AM

    I wonder if the MineRL (https://www.ijcai.org/proceedings/2019/0339.pdf and minerl.io) dataset would be sufficient to reproduce this work with Minecraft.

    Any other similar existing datasets?

    A really goofy way I can think of to get a bunch of data would be to get videos from youtube and try to detect keyboard sounds to determine what keys they're pressing.

  • by throwthrowuknow on 8/28/24, 6:31 PM

    Several thoughts for future work:

    1. Continue training on all of the games that used the Doom engine to see if it is capable of creating new graphics, enemies, weapons, etc. I think you would need to embed more details for this perhaps information about what is present in the current level so that you could prompt it to produce a new level from some combination.

    2. Could embedding information from the map view or a raytrace of the surroundings of the player position help with consistency? I suppose the model would need to predict this information as the neural simulation progressed.

    3. Can this technique be applied to generating videos with consistent subjects and environments by training on a camera view of a 3D scene and embedding the camera position and the position and animation states of objects and avatars within the scene?

    4. What would the result of training on a variety of game engines and games with different mechanics and inputs be? The space of possible actions is limited by the available keys on a keyboard or buttons on a controller but the labelling of the characteristics of each game may prove a challenge if you wanted to be able to prompt for specific details.

  • by bufferoverflow on 8/28/24, 4:44 AM

    That's probably how our reality is rendered.
  • by TheRealPomax on 8/28/24, 3:13 PM

    If by "game" you mean "literal hallucination" then yes. But if we're not trying to click-bait, then no: it's not really a game when there is no permanence or determinism to be found anywhere. It might be a "game-flavoured dream simulator", but it's absolutely not a game engine.
  • by t1c on 8/28/24, 11:32 AM

    They got DOOM running on a diffusion engine before GTA 6
  • by broast on 8/28/24, 5:16 AM

    Maybe one day this will be how operating systems work.
  • by KhoomeiK on 8/28/24, 5:30 PM

    NVIDIA did something similar with GANs in 2020 [1], except users could actually play those games (unlike in this diffusion work which just plays back simulated video). Sentdex later adapted this to play GTA with a really cool demo [2].

    [1] https://research.nvidia.com/labs/toronto-ai/gameGAN/

    [2] https://www.youtube.com/watch?v=udPY5rQVoW0

  • by dysoco on 8/28/24, 5:28 AM

    Ah finally we are starting to see something gaming related. I'm curious as to why we haven't seen more of neural networks applied to games even in a completely experimental fashion; we used to have a lot of little experimental indie games such as Façade (2005) and I'm surprised we don't have something similar years after the advent of LLMs.

    We could have mods for old games that generate voices for the characters for example. Maybe it's unfeasible from a computing perspective? There are people running local LLMs, no?

  • by troupo on 8/28/24, 6:59 AM

    Key: "predicts next frame, recreates classic Doom". A game that was analyzed and documented to death. And the training included uncountable runs of Doom.

    A game engine lets you create a new game, not predict the next frame of an existing and copiously documented one.

    This is not a game engine.

    Creating a new good game? Good luck with that.

  • by throwmeaway222 on 8/28/24, 4:04 AM

    You know how when you're dreaming and you walk into a room at your house and you're suddenly naked at school?

    I'm convinced this is the code that gives Data (ST TNG) his dreaming capabilities.

  • by gwern on 8/28/24, 3:45 PM

  • by kcaj on 8/28/24, 4:36 AM

    Take a bunch of videos of the real world and calculate the differential camera motion with optical flow or feature tracking. Call this the video’s control input. Now we can play SORA.
  • by jetrink on 8/28/24, 1:57 PM

    What if instead of a video game, this was trained on video and control inputs from people operating equipment like warehouse robots? Then an automated system could visualize the result of a proposed action or series of actions when operating the equipment itself. You would need a different model/algorithm to propose control inputs, but this would offer a way for the system to validate and refine plans as part of a problem solving feedback loop.
  • by yair99dd on 9/1/24, 6:56 AM

    Yotube user hu-po streams critical in-depth streams of Ai papers. Here is his take on this (and other relevant) paper https://www.youtube.com/live/JZgqQB4Aekc
  • by lynx23 on 8/29/24, 6:19 AM

    Hehe, this sounds like the backstory of a remake of the Terminator, or "I have no mouth, but I must scream." In the aftermath of AI killing off humanity, researchers look deeply into how this could have ahppened. And after a number of dead ends, they finally realize: it was trained, in its infancy, on Doom!
  • by wantsanagent on 8/28/24, 1:34 PM

    Anyone have reliable numbers on the file sizes here? Doom.exe from my searches was around 715k, and with all assets somewhere around 10MB. It looks like the SD 1.4 files are over 2GB, so it's likely we're looking at a 200-2000x increase in file size depending on if you think of this as an 'engine' or the full game.
  • by lukol on 8/28/24, 7:29 AM

    I believe future game engines will be state machines with deterministic algorithms that can be reproduced at any time. However, rendering said state into visual / auditory / etc. experiences will be taken over by AI models.

    This will also allow players to easily customize what they experience without changing the core game loop.

  • by nuz on 8/28/24, 10:03 AM

    I wonder how overfit it is though. You could fit a lot of doom resolution jpeg frames into 4gb (the size of SD1.4)
  • by JDEngi on 8/28/24, 9:06 AM

    This is going to be the future of cloud gaming, isn't it? In order to deal with the latency, we just generate the next frame locally, and we'll have the true frame coming in later from the cloud, so we're never dreaming too far ahead of the actual game.
  • by KETpXDDzR on 8/28/24, 3:06 PM

    I think the correct title should be "Diffusion Models Are Fake Real-Time Game Engines". I don't think just more training will ever be sufficient to create a complete game engine. It would need to "understand" what it's doing.
  • by ciroduran on 8/28/24, 10:12 AM

    Congrats on running Doom on an Diffusion Model :D

    I was really entranced on how combat is rendered (the grunt doing weird stuff in very much the style that the model generates images). Now I'd like to see this implemented in a shader in a game

  • by seydor on 8/28/24, 11:23 AM

    I wonder how far it is from this to generating language reasoning about the game from the game itself, rather than learning a large corpus of language, like LLMs do. That would be a true grounded language generator
  • by golol on 8/28/24, 8:15 AM

    Certain categories of youtube videos can also be viewed as some sort of game where the actions are the audio/transcript advanced a couple of seconds. Add two eggs. Fetch the ball. I'm walking in the park.
  • by darrinm on 8/28/24, 4:21 AM

    So… is it interactive? Playable? Or just generating a video of gameplay?
  • by holoduke on 8/28/24, 8:41 AM

    I saw a video a while ago where they recreated actual doom footage with a diffusion technique so it looked like a jungle or anything you liked. Cant find it anymore, but looked impressive.
  • by jumploops on 8/28/24, 8:33 AM

    This seems similar to how we use LLMs to generate code: generate, run, fix, generate.

    Instead of working through a game, it’s building generic UI components and using common abstractions.

  • by qnleigh on 8/28/24, 7:39 AM

    Could a similar scheme be used to drastically improve the visual quality of a video game? You would train the model on gameplay rendered at low and high quality (say with and without ray tracing, and with low and high density meshing), and try to get it to convert a quick render into something photorealistic on the fly.

    When things like DALL-E first came out, I was expecting something like the above to make it into mainstream games within a few years. But that was either too optimistic or I'm not up to speed on this sort of thing.

  • by LtdJorge on 8/28/24, 10:31 AM

    So is it taking inputs from a player and simulating the gameplay or is it just simulating everything (effectively, a generated video)?
  • by lackoftactics on 8/28/24, 12:10 PM

    I think Alan's conservative countdown to AGI will need to be updated after this. https://lifearchitect.ai/agi/ This is really impressive stuff. I thought about it a couple of months ago, that probably this is the next modality worth exploring for data, but didn't imagine it would come so fast. On the other side, the amount of compute required is crazy.
  • by acoye on 8/28/24, 10:52 AM

    Nvidia CEO reckons your GPU will be replaced with AI in “5-10 years”. So this is what the sort of first working game I guess.
  • by acoye on 8/28/24, 10:53 AM

    I'd love to see John Carmack come back from his AGI hiatus and advance AI based rendering. This would be supper cool.
  • by amunozo on 8/28/24, 9:32 AM

    This is amazing and an interesting discovery. It is a pity that I don't find it capable of creating anything new.
  • by harha_ on 8/28/24, 2:33 PM

    This is so sick I don't know what to say. I never expected this, aren't the implications of this huge?
  • by maxglute on 8/29/24, 4:02 AM

    RL tetris effect hallucination.

    Wish there was 1000s of hours of hardcore henry to train. Maybe scrape gopro war cams.

  • by nicman23 on 8/29/24, 6:40 AM

    what i want from something like this is a mix. a model that can infinitely "zoom" into an object's texture which even if not perfect it would be fine and a model that would create 3d geometry from bump maps / normals
  • by mobiuscog on 8/29/24, 8:44 AM

    Video Game streamers are next in line to be replaced by AI I guess.
  • by EcommerceFlow on 8/28/24, 1:20 PM

    Jensen said that this is the future of gaming a few months ago fyi.
  • by kqr on 8/28/24, 10:52 AM

    I have been kind of "meh" about the recent AI hype, but this is seriously impressive.

    Of course, we're clearly looking at complete nonsense generated by something that does not understand what it is doing – yet, it is astonishingly sensible nonsense given the type of information it is working from. I had no idea the state of the art was capable of this.

  • by gwbas1c on 8/28/24, 12:43 PM

    Am I the only one who thinks this is faked?

    It's not that hard to fake something like this: Just make a video of DOSBox with DOOM running inside of it, and then compress it with settings that will result in compression artifacts.

  • by amelius on 8/28/24, 9:05 AM

    Yes, and you can use an LLM to simulate role playing games.
  • by piperswe on 8/28/24, 4:38 AM

    This is honestly the most impressive ML project I've seen since... probably O.G. DALL-E? Feels like a gem in a sea of AI shit.
  • by jasonkstevens on 8/28/24, 8:39 PM

    AI no longer plays Doom-it is Doom.
  • by aghilmort on 8/28/24, 3:10 PM

    looking forward to &/or wondering about overlap with notion of ray tracing LLMs
  • by itomato on 8/28/24, 9:38 AM

    The gibs are a dead giveaway
  • by joseferben on 8/28/24, 12:16 PM

    impressive, imagine this but photo realistic with vr goggles.
  • by thegabriele on 8/28/24, 8:19 AM

    Wow, I bet Boston Dynamics and such are quite interested
  • by YeGoblynQueenne on 8/28/24, 1:37 PM

    Misleading Titles Are Everywhere These Days.
  • by danielmarkbruce on 8/28/24, 6:33 PM

    What is the point of this? It's hard to see how this is useful. Maybe it's just an exercise to show what a diffusion model can do?
  • by richard___ on 8/28/24, 6:19 AM

    Uhhh… demos would be more convincing with enemies and decreasing health
  • by dean2432 on 8/28/24, 4:19 AM

    So in the future we can play FPS games given any setting? Pog
  • by sitkack on 8/28/24, 5:32 AM

    What most programmers don't understand, that in the very near future, the entire application will be delivered by an AI model, no source, no text, just connect to the app over RDP. The whole app will be created by example, the app developer will train the app like a dog trainer trains a dog.