by jmorgan on 8/28/24, 2:59 AM with 409 comments
by vessenes on 8/28/24, 3:29 AM
The two main things of note I took away from the summary were: 1) they got infinite training data using agents playing doom (makes sense), and 2) they added Gaussian noise to source frames and rewarded the agent for ‘correcting’ sequential frames back, and said this was critical to get long range stable ‘rendering’ out of the model.
That last is intriguing — they explain the intuition as teaching the model to do error correction / guide it to be stable.
Finally, I wonder if this model would be easy to fine tune for ‘photo realistic’ / ray traced restyling — I’d be super curious to see how hard it would be to get a ‘nicer’ rendering out of this model, treating it as a doom foundation model of sorts.
Anyway, a fun idea that worked! Love those.
by wkcheng on 8/28/24, 3:42 AM
Abstractly, it's like the model is dreaming of a game that it played a lot of, and real time inputs just change the state of the dream. It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.
by SeanAnderson on 8/28/24, 3:55 PM
It's trained on a large set of data in which agents played DOOM and video samples are given to users for evaluation, but users are not feeding inputs into the simulation in real-time in such a way as to be "playing DOOM" at ~20FPS.
There are some key phrases within the paper that hint at this such as "Key questions remain, such as ... how games would be effectively created in the first place, including how to best leverage human inputs" and "Our end goal is to have human players interact with our simulation.", but mostly it's just the omission of a section describing real-time user gameplay.
by zzanz on 8/28/24, 3:32 AM
by godelski on 8/28/24, 8:43 AM
- 4 MB RAM
- 12 MB disk space
Stable diffusion v1 > 860M UNet and CLIP ViT-L/14 (540M)
Checkpoint size:
4.27 Gb
7.7 GB (full EMA)
Running on a TPU-v5e
Peak compute per chip (bf16) 197 TFLOPs
Peak compute per chip (Int8) 393 TFLOPs
HBM2 capacity and bandwidth 16 GB, 819 GBps
Interchip Interconnect BW 1600 Gbps
This is quite impressive, especially considering the speed. But there's still a ton of room for improvement. It seems it didn't even memorize the game despite having the capacity to do so hundreds of times over. So we definitely have lots of room for optimization methods. Though who knows how such things would affect existing tech since the goal here is to memorize.What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute. I'm curious what the comparison in cost and time would be if you hired an engineer to reverse engineer Doom (how much prior knowledge do they get considering pertained models and visdoom environment. Was doom source code in T5? And which vit checkpoint was used? I can't keep track of Google vit checkpoints).
I would love to see the checkpoint of this model. I think people would find some really interesting stuff taking it apart.
- https://www.reddit.com/r/gaming/comments/a4yi5t/original_doo...
- https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin...
- https://cloud.google.com/tpu/docs/v5e
by Sohcahtoa82 on 8/28/24, 4:48 PM
Some of ya'll need to learn how to make things for the fun of making things. Is this useful? No, not really. Is it interesting? Absolutely.
Not everything has to be made for profit. Not everything has to be made to make the world a better place. Sometimes, people create things just for the learning experience, the challenge, or they're curious to see if something is possible.
Time spent enjoying yourself is never time wasted. Some of ya'll are going to be on your death beds wishing you had allowed yourself to have more fun.
by HellDunkel on 8/28/24, 8:35 AM
by refibrillator on 8/28/24, 3:59 AM
Yes they had to use RL to learn what DOOM looks like and how it works, but this doesn’t necessarily pose a chicken vs egg problem. In the same way that LLMs can write a novel story, despite only being trained on existing text.
IMO one of the biggest challenges with this approach will be open world games with essentially an infinite number of possible states. The paper mentions that they had trouble getting RL agents to completely explore every nook and corner of DOOM. Factorio or Dwarf Fortress probably won’t be simulated anytime soon…I think.
by danjl on 8/28/24, 3:34 AM
by dtagames on 8/28/24, 1:05 PM
These tools are fascinating but, as with all AI hype, they need a disclaimer: The tool didn't create the game. It simply generated frames and the appearance of play mechanics from a game it sampled (which humans created).
by alkonaut on 8/28/24, 1:28 PM
I'm guessing that the "This door requires a blue key" doesn't mean that the user can run around, the engine dreams up a blue key in some other corner of the map, and the user can then return to the door and the engine now opens the door? THAT would be impressive. It's interesting to think that all that would be required for that task to go from really hard to quite doable, would be that the door requiring the blue key is blue, and the UI showing some icon indicating the user possesses the blue key. Without that, it becomes (old) hidden state.
by helloplanets on 8/28/24, 6:05 AM
Given a sufficient enough separation between these two, couldn't you basically boil the game/input logic down to an abstract game template? Meaning, you could just output a hash that corresponds to a specific combination of inputs, and then treat the resulting mapping as a representation of a specific game's inner workings.
To make it less abstract, you could save some small enough snapshot of the game engine's state for all given input sequences. This could make it much less dependent to what's recorded off of the agents' screens. And you could map the objects that appear in the saved states to graphics, in a separate step.
I imagine this whole system would work especially well for games that only update when player input is given: Games like Myst, Sokoban, etc.
by panki27 on 8/28/24, 8:38 AM
I can hardly believe this claim, anyone who has played some amount of DOOM before should notice the viewport and textures not "feeling right", or the usually static objects moving slightly.
by golol on 8/28/24, 8:10 AM
by mo_42 on 8/28/24, 5:54 AM
I'm wondering when people will apply this to other areas like the real world. Would it learn the game engine of the universe (ie physics)?
by icoder on 8/28/24, 8:38 AM
(I say it can't count because there are numerous examples where the bullet count glitches, it goes right impressively often, but still, counting, being up or down, is something computers have been able to do flawlessly basically since forever)
(It is the same with chess, where the LLM models are becoming really good, yet sometimes make mistakes that even my 8yo niece would not make)
by lIl-IIIl on 8/28/24, 8:06 AM
Most enemies have enough hit points to survive the first shot. If the model is only trained on the previous frame, it doesn't know how many times the enemy was already shot at.
From the video it seems like it is probability based - they may die right away or it might take way longer than it should.
I love how the player's health goes down when he stands in the radioactive green water.
In Doom the enemies fight with each other if they accidentally incur "friendly fire". It would be interesting to see it play out in this version.
by masterspy7 on 8/28/24, 3:49 AM
by nolist_policy on 8/28/24, 7:16 AM
by smusamashah on 8/28/24, 10:38 AM
I noticed a few hallucinations e.g. when it picked green jacket from a corner, walking back it generated another corner. Therefore I don't think it has any clue about the 3D world of the game at all.
by ravetcofx on 8/28/24, 3:39 AM
by Kapura on 8/28/24, 6:45 PM
by arduinomancer on 8/28/24, 5:13 AM
Like if I kill an enemy in some room and walk all the way across the map and come back, would the body still be there?
by rldjbpin on 8/29/24, 8:48 AM
to me it seems like a very bruteforce or greedy way to give the impression to a user that they are "playing" a game. the difference being that you already own the game to make this possible, but cannot let the user use that copy!
using generative AI for game creation is at a nascent stage but there are much more elegant ways to go about the end goal. perhaps in the future with computing so far ahead that we moved beyond the current architecture, this might be worth doing instead of emulation perhaps.
by dabochen on 8/28/24, 1:32 PM
If so, is it more like imagination/hallucination rather than rendering?
by rrnechmech on 8/28/24, 3:43 PM
I get this (mostly). But would any kind soul care to elaborate on this? What is this "drift" they are trying to avoid and how does (AFAIU) adding noise help?
by jamilton on 8/28/24, 7:32 AM
Any other similar existing datasets?
A really goofy way I can think of to get a bunch of data would be to get videos from youtube and try to detect keyboard sounds to determine what keys they're pressing.
by throwthrowuknow on 8/28/24, 6:31 PM
1. Continue training on all of the games that used the Doom engine to see if it is capable of creating new graphics, enemies, weapons, etc. I think you would need to embed more details for this perhaps information about what is present in the current level so that you could prompt it to produce a new level from some combination.
2. Could embedding information from the map view or a raytrace of the surroundings of the player position help with consistency? I suppose the model would need to predict this information as the neural simulation progressed.
3. Can this technique be applied to generating videos with consistent subjects and environments by training on a camera view of a 3D scene and embedding the camera position and the position and animation states of objects and avatars within the scene?
4. What would the result of training on a variety of game engines and games with different mechanics and inputs be? The space of possible actions is limited by the available keys on a keyboard or buttons on a controller but the labelling of the characteristics of each game may prove a challenge if you wanted to be able to prompt for specific details.
by bufferoverflow on 8/28/24, 4:44 AM
by TheRealPomax on 8/28/24, 3:13 PM
by t1c on 8/28/24, 11:32 AM
by broast on 8/28/24, 5:16 AM
by KhoomeiK on 8/28/24, 5:30 PM
by dysoco on 8/28/24, 5:28 AM
We could have mods for old games that generate voices for the characters for example. Maybe it's unfeasible from a computing perspective? There are people running local LLMs, no?
by troupo on 8/28/24, 6:59 AM
A game engine lets you create a new game, not predict the next frame of an existing and copiously documented one.
This is not a game engine.
Creating a new good game? Good luck with that.
by throwmeaway222 on 8/28/24, 4:04 AM
I'm convinced this is the code that gives Data (ST TNG) his dreaming capabilities.
by gwern on 8/28/24, 3:45 PM
by kcaj on 8/28/24, 4:36 AM
by jetrink on 8/28/24, 1:57 PM
by yair99dd on 9/1/24, 6:56 AM
by lynx23 on 8/29/24, 6:19 AM
by wantsanagent on 8/28/24, 1:34 PM
by lukol on 8/28/24, 7:29 AM
This will also allow players to easily customize what they experience without changing the core game loop.
by nuz on 8/28/24, 10:03 AM
by JDEngi on 8/28/24, 9:06 AM
by KETpXDDzR on 8/28/24, 3:06 PM
by ciroduran on 8/28/24, 10:12 AM
I was really entranced on how combat is rendered (the grunt doing weird stuff in very much the style that the model generates images). Now I'd like to see this implemented in a shader in a game
by seydor on 8/28/24, 11:23 AM
by golol on 8/28/24, 8:15 AM
by darrinm on 8/28/24, 4:21 AM
by holoduke on 8/28/24, 8:41 AM
by jumploops on 8/28/24, 8:33 AM
Instead of working through a game, it’s building generic UI components and using common abstractions.
by qnleigh on 8/28/24, 7:39 AM
When things like DALL-E first came out, I was expecting something like the above to make it into mainstream games within a few years. But that was either too optimistic or I'm not up to speed on this sort of thing.
by LtdJorge on 8/28/24, 10:31 AM
by lackoftactics on 8/28/24, 12:10 PM
by acoye on 8/28/24, 10:52 AM
by acoye on 8/28/24, 10:53 AM
by amunozo on 8/28/24, 9:32 AM
by harha_ on 8/28/24, 2:33 PM
by maxglute on 8/29/24, 4:02 AM
Wish there was 1000s of hours of hardcore henry to train. Maybe scrape gopro war cams.
by nicman23 on 8/29/24, 6:40 AM
by mobiuscog on 8/29/24, 8:44 AM
by EcommerceFlow on 8/28/24, 1:20 PM
by kqr on 8/28/24, 10:52 AM
Of course, we're clearly looking at complete nonsense generated by something that does not understand what it is doing – yet, it is astonishingly sensible nonsense given the type of information it is working from. I had no idea the state of the art was capable of this.
by gwbas1c on 8/28/24, 12:43 PM
It's not that hard to fake something like this: Just make a video of DOSBox with DOOM running inside of it, and then compress it with settings that will result in compression artifacts.
by amelius on 8/28/24, 9:05 AM
by piperswe on 8/28/24, 4:38 AM
by jasonkstevens on 8/28/24, 8:39 PM
by aghilmort on 8/28/24, 3:10 PM
by itomato on 8/28/24, 9:38 AM
by joseferben on 8/28/24, 12:16 PM
by thegabriele on 8/28/24, 8:19 AM
by YeGoblynQueenne on 8/28/24, 1:37 PM
by danielmarkbruce on 8/28/24, 6:33 PM
by richard___ on 8/28/24, 6:19 AM
by dean2432 on 8/28/24, 4:19 AM
by sitkack on 8/28/24, 5:32 AM