« BackDiffusion for World Modelingdiamond-wm.github.ioSubmitted by francoisfleuret 16 hours ago
  • smusamashah 13 hours ago

    This video https://x.com/Sentdex/status/1845146540555243615 looks way too much like my dreams. This is almost exactly that happens when I sometimes try to jump high, it transforms me to a different place just like that. Things keep changing just like that. It's amazing to see how close it is to a real dream experience.

    • kleene_op 11 hours ago

      I noticed that all text looked garbled up when I had some lucid dreams. When diffusion models started to gain attention, I made the connection that text generated in generated images also looked garbled up.

      Maybe all of those are clues that parts of the human subconscious mind operate pretty close to the principles behind diffusion models.

      • qwertox 8 hours ago

        I don't think lucid dreaming is a requirement for this. Whenever I dream my environment morphs into another one, scene by scene, things I try to get details from, like the content of a text, refuse to show clearly enough to extract any meaningful information from it, no matter what I try.

        • smusamashah 10 hours ago

          I also lucid dream occasionally. Very rarely things are very detailed, most often the colors and details are just as bleak and blurry and keep changing as these videos. I walk down a street, take a turn (or not), its almost guaranteed I can't go back to where I came from. I usually appreciate when I can track back the same path.

          • sci_prog 11 hours ago

            Also the AI generated images that can't get the fingers right. Have you ever tried to look at your hands while lucid dreaming and try counting fingers? There are some really interesting parallels between the dreams and diffusion models.

            • dartos 10 hours ago

              Of course, due to the very nature of dreams, your awareness of diffusion models and their output flavors how you perceive even past dreams.

              Our brains love retroactively altering fuzzy memories.

              • hombre_fatal 9 hours ago

                On the other hand, psychedelics give you perceptions similar to even early deepdream genai images.

                On LSD, I was swimming in my friend’s pool (for six hours…) amazed at all the patterns on his pool tiles underwater. I couldn’t get enough. Every tile had a different sophisticated pattern.

                The next day I went back to his place (sober) and commented on how cool his pool tiles were. He had nfi what I was talking about.

                I walk out to the pool and sure enough it’s just a grid of small featureless white tiles. Upon closer inspection they have a slight grain to them. I guess my brain was connecting the dots on the grain and creating patterns.

                It was quite a trip to be so wrong about reality.

                Not really related to your claim I guess but I haven’t thought of this story in 10 years and don’t want to delete it.

                • Jackson__ 8 hours ago

                  This may be a joke, but counting your fingers to lucid dream has been a thing for a lot longer than diffusion models.

                  That being said, your reality will influence your dreams if you're exposed to some things enough. I used to play minecraft on a really bad PC back in the day, and in my lucid dreams I used to encounter the same slow chunk loading as I saw in the game.

            • siavosh 6 hours ago

              What’s amazing is that if you really start paying attention it seems like the mind is often doing the same thing when you’re awake, less noticeable with your visual field but more noticeable with attention and thoughts themselves.

              • smusamashah 5 hours ago

                This is a very interesting thought. I never thought of mind doing anything like that in wake state. I know I will now be thinking about this idea every time I recall those dreams.

                • TaylorAlexander 4 hours ago

                  Yeah I also hadn’t but it makes sense. Just like all output from an LLM is a “hallucination” but we have a tendency to only call it a hallucination when something looks wrong about the result, we forget that our conscious lived experience is a hallucination based on a bunch of abstract sensory data that our brain fuses in to a world state experience. It’s obvious when we are asleep that dreams are hallucinations, but it is less obvious that the conscious experience is too.

              • jvanderbot 7 hours ago

                This is why I'm excited in a limited way. Clearly something is disconnected in a dream state that has an analogous disconnect here.

                I think these models lack a world model, something with strong spatial reasoning and continuity expectations that animals have.

                Of course that's probably learned too.

                • earnesti 10 hours ago

                  That looks way too much to the one time I did DMT-5

                  • TechDebtDevin 9 hours ago

                    Machine Elves

                    • loxias 4 minutes ago

                      IYKYK

                  • thegabriele 10 hours ago

                    We are unconsciously (pun intended) implementing how brains work both in dream and wake states. Can't wait until we add some kind of (lossless) memory to this models.

                    • hackernewds 10 hours ago

                      Any evidence to back this lofty claim?

                      • sweeter 6 hours ago

                        vibes

                      • soraki_soladead 10 hours ago

                        We have lossless memory for models today. That's the training data. You could consider this the offline version of a replay buffer which is also typically lossless.

                        The online, continuous and lossy version of this problem is more like how our memory works and still largely unsolved.

                    • francoisfleuret 15 hours ago

                      This is 300M parameters model (1/1300th of the big llama-3) trained with 5M frames with 12 days of a GTX4090.

                      This is what a big tech company was doing in 2015.

                      The same stuff at industrial scale à la large LLMs would be absolutely mind blowing.

                      • gjulianm 12 hours ago

                        What exactly would be the benefit of that? We already have Counter Strike working far more smooth than this, without wasting tons of compute.

                        • ben_w 12 hours ago

                          As with diffusion models in general, the point isn't the specific example but that it's generalisable.

                          5 million frames of video data with corresponding accelerometer data, and you get this for genuine photorealism.

                          • gjulianm 8 hours ago

                            Generalisable how? The model completely hallucinates invalid input, it's not even high quality and required CSGO to work. What's the output you expect from this and what alternatives are there?

                            • Art9681 2 hours ago

                              None of those questions are relevant are they? I get the impression you've already decided this isnt good enough, which is basically agreeing with everyone else. No one is talking about what it's capable of today. Read the thread again. We're imagining the great probability a few permutations later this thing will basically be The Matrix.

                              • ben_w 6 hours ago

                                It did not require CSGO, that was simply one of their examples. The very first video in the link shows a bunch of classic Arati games, and even the video which is showing CSGO is captioned "DIAMOND's diffusion world model can also be trained to simulate 3D environments, such as CounterStrike: Global Offensive (CSGO)" — I draw your attention to "such as" being used rather than "only".

                                And I thought I was fairly explicit about video data, but just in case that's ambiguous: the stuff you record with your phone camera set to video mode, synchronised with the accelerometer data instead of player keyboard inputs.

                                As for output, with the model as it currently stands, I'd expect a 24h training video at 60fps to be "photorealisic and with similar weird hallucinations". Which is still interesting, even without combining this with a control net like Stable Diffusion can do.

                                • empath75 4 hours ago

                                  You do the same thing at a larger scale, and instead of video game footage you use a few million hours of remote controlled drone input in the real world.

                              • stale2002 8 hours ago

                                To answer your question directly, the benefit is that we could make something different from counter strike.

                                You see, there are these things called "proof of concept"s that are meant to not be a product, but instead show off capabilities.

                                Counterstrike is an example, meant to show off complex capabilities. It is not meant to show how the useful thing of these models is to literally recreate counterstrike.

                                • gjulianm 8 hours ago

                                  Which capabilities are being shown off here? The ability to take an already existing world-model and take lots of compute to have a worse, less correct model?

                                  • stale2002 6 hours ago

                                    The capability to have mostly working, real time generation of images that represent a world model.

                                    If that capability is possible, then it could be possible to take 100 examples of seperate world models that exist, and then combine those world models together in interesting ways.

                                    Combining together world models is an obvious next step (IE, not showed off in this proof of concept. But it is a logical/plausible future capability).

                                    Having multiple world models combined together in new and interesting ways, is almost like creating an entirely new world model, even though thats not exactly the same.

                                • nuz 11 hours ago

                                  "What would be the point of creating a shooter set in the middle east? We already have pong and donkey kong"

                                  • eproxus 12 hours ago

                                    But please, think of the shareholders!

                                  • GaggiX 15 hours ago

                                    If 12 days with an RTX4090 is all you need, some random people on the Internet will soon start training their own.

                                    • cs702 8 hours ago

                                      Came here to say pretty much the same thing, and saw your comment.

                                      The rate of progress has been mind-blowing indeed.

                                      We sure live in interesting times!

                                      • Sardtok 13 hours ago

                                        Two 4090s, but yeah.

                                        • Sardtok 13 hours ago

                                          Never mind, the repo on Github says 12 days on a 4090, so I'm unsure why the title here says two.

                                      • marcyb5st 13 hours ago

                                        So, this is pretty exciting.

                                        I can how this can already be used to generate realistic physics approximations in a game engine. You create a bunch of snippets of gameplay using a much heavier and realistic physics engine (perhaps even CGI). The model learn to approximate the physics and boom, now you have a lightweight physics engine. Perhaps you can even have several that are specialized (e.g. one for smoke dynamics, one for explosions, ...). Even if it allucinates, wouldn't be worse than the physics bugs that are so common in games.

                                        • monsieurbanana 13 hours ago

                                          > Even if it allucinates, wouldn't be worse than the physics bugs that are so common in games.

                                          I don't know about that. Physic bugs are common, but you can prioritize and fix the worst (gamebreaking) ones. If you have a blackbox model, it becomes much harder to do that.

                                          • bobsomers 9 hours ago

                                            What makes you think the network inference is less expensive? Newtonian physics is already extremely well known and pretty computationally efficient to compute.

                                            How would a "function approximation" of Newtonian physics, with billions of parameters, be cheaper to compute?

                                            It seems like this would both be more expensive and less correct than a proper physics simulation.

                                            • crackalamoo 34 minutes ago

                                              Basic Newtonian physics is pretty efficient to compute, but afaik some more complex physics like fluids is faster with network inference. There are probably a lot of cases where network inference physics is faster.

                                            • twic 12 hours ago

                                              Do you think that inference on a thirteen million parameter neural network is more lightweight than running a conventional physics engine?

                                              • procgen 10 hours ago

                                                Convincing liquid physics (e.g. surf interacting with a beach, rocks, the player character) might be a good candidate.

                                                • tiagod 9 hours ago

                                                  In some cases, the model will be lighter. There is no need for 14M parameters for physics simulations, and there's a lot of promising work in that area.

                                                  • epolanski 12 hours ago

                                                    Every software that can be implemented in a JavaScript, ehm, LLM, will eventually be implemented in an LLM.

                                                    • kendalf89 10 hours ago

                                                      Are you predicting node.llm right now?

                                                  • slashdave 32 minutes ago

                                                    Define "lightweight".

                                                    • crazygringo 8 hours ago

                                                      Yeah, I definitely wouldn't trust it to replace basic physics of running, jumping, bullets, objects shattering, etc.

                                                      But it seems extremely promising for fiery explosions, smoke, and especially water. Anything with dynamics that are essentially complex.

                                                      Also for lighting -- both to get things like skin right with subsurface scattering, as well as global ray-traced lighting.

                                                      You can train specific lightweight models for these things, and they important thing is that their output is correct at the macro level. E.g., a tree should be casting a shadow that looks like the right shadow at the right angle for that type of tree and its types of leaves and general shape. Nobody cares if each individual leaf shadow corresponds to an individual leaf 10 feet above or is just hallucinated.

                                                      • Thorrez 12 hours ago

                                                        Would that work for multiplayer? If it's a visual effect only, I guess it would be ok. But if it affects gameplay, wouldn't different players get different results?

                                                        • killerstorm 12 hours ago

                                                          Well, it doesn't make sense to use this exact model - this is just demonstration that it can learn world model from pixels.

                                                          An obvious next step towards a more playable game is to add state vector to the inputs of the model: it is easier to learn to render the world from pixels + state vectors than from pixels alone.

                                                          Then it depends what we want to do. If we want normal Counter Strike gameplay but with new graphics, we can keep existing CS game server and train only the rendering part.

                                                          If you want to make Dream-Counter-Strike where rules are more bendable then you might want to train state update model...

                                                        • amelius 10 hours ago

                                                          > boom, now you have a lightweight physics engine

                                                          lightweight, but producing several hundred watts of heat.

                                                          • hobs 13 hours ago

                                                            A physics bug would be a consistent problem you can fix. There's no such guarantee about an ML model. This would likely only be ok in the context of a game specifically made to be janky.

                                                            • fullstackwife 13 hours ago

                                                              This is one of the fallacies of current AI research space: they don't focus on the end-user too much. In this case the end-user would be the gamer, and while playing games you expect a valid gameplay, so those kind of hallucinations are not acceptable, while I'm pretty sure they give the AI research authors a strong dopamine trigger. We have a hammer and now we are looking for a nail, while you should ask a question first: what is the problem we are trying to solve here?

                                                              Real world usage will be probably different, and maybe even unexpected by the authors of this research.

                                                              • jsheard 13 hours ago

                                                                > This is one of the fallacies of current AI research space: they don't focus on the end-user too much. In this case the end-user would be the gamer

                                                                Or from another angle the end-user is a game developer trying to actually work with this kind of technology, which is just a nightmarish prospect. Nobody in the industry is asking for a game engine that runs entirely on vibes and dream logic, gamedev is already chaotic enough when everything is laid out in explicit code and data.

                                                                • stale2002 8 hours ago

                                                                  > they don't focus on the end-user too much.

                                                                  Of course they don't. Stuff like this is a proof of concept.

                                                                  If they had a product that worked, they wouldn't be in academia. Instead, they would leave the world of research and create a multi billion dollar company.

                                                                  Almost by definition, anything in academia isn't going to be productized, because if it was, then the researchers would just stop researching and make a bunch of money selling the product to consumers.

                                                                  Such research is still useful for society, though, as it means that someone else can spend the millions and millions of dollars making a better version and then selling that.

                                                                  • badpun 12 hours ago

                                                                    The whole purpose of academia is literally to nerd out on cool, impractical things, which will ocasionally turn out to have some real-life relevance years or decades later. This (hallucinated CS) is still more relevant to real world than 99% of what happens in academic research.

                                                                    • dartos 10 hours ago

                                                                      Yes to the first part, no to the random “99% useless” number you made up.

                                                                      I’m no fan of academia, but it undeniably produces useful and meaningful knowledge regularly.

                                                                  • kqr 13 hours ago

                                                                    This obsession people have with determinism! I'd much rather take a low rate of weird bugs than common consistent ones. I don't believe reproducibility of bugs makes for better gameplay generally.

                                                                    • paulryanrogers 12 hours ago

                                                                      Reproducibility does make bugs more likely to be fixed, or at least fixable.

                                                                      Also, games introduce randomness in a controlled way so users don't get frustrated by it appearing in unexpected places. I don't want characters to randomly appear and disappear. It's fine if bullet trajectory varies more randomly as they get further away.

                                                                      • skydhash 11 hours ago

                                                                        Also most engines have been worked on for years. So more often than not, core elements like audio, physics, input,... are very stable and the remaining bugs are either "can't fix" or "won't fix".

                                                                      • NotMichaelBay 12 hours ago

                                                                        It might be fine for casual players, but it would prevent serious and pro players from getting into the game. In Counter-Strike, for example, pro players (and other serious players) practice specific grenade throws so they can use them reliably in matches.

                                                                        • kqr 12 hours ago

                                                                          I'm not saying one can make specifically Counter-Strike on a non-deterministic engine -- that seems like strawmanning my argument.

                                                                          People play and enjoy many games with varying levels of randomness as a fundamental component, some even professionally (poker, stock market). This could be made such a game.

                                                                          • monsieurbanana 11 hours ago

                                                                            Either the physics engine matter, in which case you want a deterministic engine as you said, or it doesn't like in a poker game and you don't want to spend much resources (manpower, computer cycles) into it.

                                                                            Which also means an off-the-shelf deterministic engine.

                                                                        • mrob 12 hours ago

                                                                          The whole hobby of speedrunning relies heavily on exploiting deterministic game bugs.

                                                                          • dartos 10 hours ago

                                                                            You don’t play a lot of games, huh?

                                                                            Consistent bugs you can anticipate and play/work around, random ones you can’t. Just look at pretty much any speed running community for games before 1995.

                                                                            Say goodbye to any real competitive scene with random unfixable potentially one off bugs.

                                                                            • hobs 10 hours ago

                                                                              Make a fun game with this as a premise and I will try it, but it sounds just an annoying concept.

                                                                        • croo 15 hours ago

                                                                          For anyone who actually tried it :

                                                                          Does it respects/builds some kind of game map in the process or is it just a bizarre psychedelic dream walk experience where you cannot go back the same place twice and space dimensions are just funny? Is a game map finite?

                                                                          • InsideOutSanta 14 hours ago

                                                                            Just looking at the first video, there's a section where structures just suddenly appear in front of the player, so this does not appear to build any kind of map, or have any kind of meaningful awareness of something resembling a game state.

                                                                            This is similar to LLM-based RPGs I've played, where you can pick up a sword and put it in your empty bag, and then pull out a loaf of bread and eat it.

                                                                            • anal_reactor 14 hours ago

                                                                              > you can pick up a sword and put it in your empty bag, and then pull out a loaf of bread and eat it

                                                                              Mondays

                                                                            • aidos 15 hours ago

                                                                              Just skimmed the article but my guess is that it’s a dream type experience where if you turned around 180 and walked the other direction it wouldn’t correspond to where you just came from. More like an infinite map.

                                                                              • lopuhin 13 hours ago

                                                                                I don't think so, what they show on CS video is exactly the Dust2 map, not just something similar/inspired by it.

                                                                                • twic 12 hours ago

                                                                                  It's trained on moving around dust2, so as long as the previous frame was a view of dust2, the next frame is very likely to be a plausible subsequent view of dust2. In some sense, this encodes a map; but it's not what most people think of when they think about maps.

                                                                                  I'd be interested to see what happens if you look down at your feet for a while, then back up. If the ground looks the same everywhere, do you come up in a random place?

                                                                                  • arendtio 11 hours ago

                                                                                    It probably depends on what you see. As long as you have a broad view over a part of the map, you should stay in that region, but I guess that if you look at a mono-color wall, you probably find yourself in a very different part of the map when you look around yourself again.

                                                                                    But I am just guessing, and I haven't tried it yet.

                                                                                • delusional 15 hours ago

                                                                                  Just tried it out, and no. It doesn't have any sort of "map" awareness. It's very much in the "recall/replay" category of "AI" where it seems to accurately recall stuff that is part of the training dataset, but as soon as you do something not in there (like walk into a wall), it completely freaks out and spits out gibberish. Plausible gibberish, but gibberish none the less.

                                                                                  • neongreen 15 hours ago

                                                                                    Can you upload a screen recording? I don’t think I can run the model locally but it’d be super interesting to see what happens if you run into a wall

                                                                                    • kqr 13 hours ago

                                                                                      This should mainly be a matter of giving it more training though, right? It sounds like to amount of training it's gotten is relatively sparse.

                                                                                      • treyd 12 hours ago

                                                                                        It doesn't have any ability to reason about what you did more than a couple of seconds ago. Its memory is what's currently on the screen and what the user's last few inputs were.

                                                                                        • delusional 8 hours ago

                                                                                          Theoretically. In practice, that's not clear. As you add more training data you have to ask yourself what the point is. we already have a pretty good simulation of Counter Strike.

                                                                                    • jmchambers 14 hours ago

                                                                                      I _think_ I understand the basic premise behind stable diffusion, i.e., reverse the denoising process to generate realistic images but, as far as I know, this is always done at the pixel level. Is there any research attempting to do this at the 3D asset level, i.e., subbing in game engine assets (with position and orientation) until a plausible scene is recreated? If it were possible to do it that way, couldn't it "dream" up real maps, with real physics, and so avoid the somewhat noisy output these types of demo generate?

                                                                                      • desdenova 14 hours ago

                                                                                        I think the closest we have right now is 3D gaussian splatting.

                                                                                        So far it's only been used to train a scene from photographs from multiple angles and rebuild it volumetrically by adjusting densities in a point-cloud.

                                                                                        But it might be possible to train a model on multiple different scenes, and perform diffusion on a random point cloud to generate new scenes.

                                                                                        Rendering a point cloud in real time is also very efficient, so it could be used to create insanely realistic game worlds instead of polygonal geometry.

                                                                                        It seems someone already thought of that: https://ar5iv.labs.arxiv.org/html/2311.11221

                                                                                        • jmchambers 14 hours ago

                                                                                          Interesting, I guess that takes things even further and removes the need for hand-crafted 3D assets altogether, which is probably how things will end up going in gaming, long-term.

                                                                                          I was suggesting a more modest approach, I guess, one where the reverse-denoising process involves picking and placing existing 3D assets, e.g., those in GTA 5, so that the process is actually building a plausible map, using those 3D assets, but on the fly...

                                                                                          Turn your car right and a plausible street decorated with buildings, trees and people is dreamt up by the algorithm. All the lighting and physics would still be done in-engine, with stable diffusion acting as a dynamic map creator, with an inherent knowledge of how to decorate a street with a plausible mix of assets.

                                                                                          I suppose it could form the basis of a procedurally generated game world where, given the same random seed, it could generate whole cities or landscapes that would be the same on each player's machine. Just an idea...

                                                                                          • skydhash 10 hours ago

                                                                                            The thing is that, there are generators that can do exactly this, no need to have an LLM as the middle man. Things like terrain generation, city generation, crowd control, character generation, can be done quite easily with far less compute and energy.

                                                                                          • magicalhippo 11 hours ago

                                                                                            Technically I guess one could do a stable diffusion-like model except on voxels, where instead of pixel intensity values it producing a scalar field which you could turn into geometry using marching cubes or something similar.

                                                                                            Not sure how efficient that would be though, and would only work for assets like teapots and whatnot, not whole game maps say.

                                                                                            • desdenova an hour ago

                                                                                              That's a simplified version of what a point cloud stores, but only works with cubes then.

                                                                                              A point cloud is basically a 3D texture of colors and densities, so a raymarching algorithm can traverse it adding densities it collides with to find the final fragment color. That's how realistic fog and clouds are rendered in games nowadays, and it's very fast, except they use a noise function instead of a scene model.

                                                                                          • slashdave 30 minutes ago

                                                                                            Stable diffusion is in latent space, not by pixel.

                                                                                            • furyofantares 6 hours ago

                                                                                              > but, as far as I know, this is always done at the pixel level

                                                                                              Image models are NOT denoised at the pixel level - diffusion happens in latent space. This was one of the big breakthroughs that made all of this work well.

                                                                                              There's a model for encoding/decoding between pixels and latent space. Latent space is able to encode whatever concepts it needs in whichever of its dimensions it needs, and is generally lower dimensional than pixel space. So we get a noisy latent space, denoise it using the diffusion model, then use the other model (variational autoencoder) to decode into pixel space.

                                                                                              • jampekka 13 hours ago

                                                                                                Not exactly 3D assets, but diffusion modems are used to generate e.g. traffic (vehicle trajectories) for evaluating autonomous vehicle algorithms. These vehicles tend to crash quite a lot.

                                                                                                For example https://github.com/NVlabs/CTG

                                                                                                Edit: fixed link

                                                                                                • tiborsaas 13 hours ago

                                                                                                  Generating this at pixel level is the next level thing. The reverse engineering method your described is probably appealing because it's easier to understand.

                                                                                                  Focusing on pixel level generation is the right approach I think. The somewhat noisy output will be improved upon probably in a short timeframe. Now that they proved with Doom (https://gamengen.github.io/) and this that it's possible, probably more research is happening currently to nail the correct architecture to scale this to HD and minimal hallucination. It happened with videos alredy so we should see a similar level breakthrough soon.

                                                                                                  • gliptic 14 hours ago

                                                                                                    > I _think_ I understand the basic premise behind stable diffusion, i.e., reverse the denoising process to generate realistic images but, as far as I know, this is always done at the pixel level.

                                                                                                    It's typically not done at the pixel level, but at the "latent space" level of e.g. a VAE. The image generation is done in this space, which has fewer outputs than the pixels of the final image, and then converted to the pixels using the VAE.

                                                                                                    • jmchambers 13 hours ago

                                                                                                      Frantically Googles VAE...

                                                                                                      Ah, okay, so the work is done at a different level of abstraction, didn't know that. But I guess it's still a pixel-related abstraction, and it is converted back to pixels to generate the final image?

                                                                                                      I suppose in my proposed (and probably implausible) algorithm, that different level of abstraction might be loosely analogous to collections of related game engine assets that are often used together, so that the denoising algorithm might be effectively saying things like "we'll put some building-related assets here-ish, and some park-related flora assets over here...", and then that gets crystallised in to actual placement of individual assets in the post-processing step.

                                                                                                      • StevenWaterman 12 hours ago

                                                                                                        (High level, specifics are definitely wrong here)

                                                                                                        The VAE isn't really pixel-level, it's semantic-level. The most significant bits in the encoding are like "how light or dark is the image" and then towards the other end bits represent more niche things like "if it's an image of a person, make them wear glasses". This is way more efficient than using raw pixels because it's so heavily compressed, there's less data. This was one of the big breakthroughs of stable diffusion compared to previous efforts like disco diffusion that work on the pixel level.

                                                                                                        The VAE encodes and decodes images automatically. It's not something that's written, it's trained to understand the semantics of the images in the same way other neural nets are.

                                                                                                  • cousin_it 15 hours ago

                                                                                                    I continue to be puzzled by people who don't notice the "noise of hell" in NN pictures and videos. To me it's always recognizable and terrifying, has been from the start.

                                                                                                    • npteljes 14 hours ago

                                                                                                      What do you mean by noise of hell in particular? I do notice that the images are almost always uncanny in a way, but maybe we're not meaning the same thing. Could you elaborate on what you experience?

                                                                                                      • taneq 14 hours ago

                                                                                                        Like a subtle but unsettling babble/hubbub/cacophony? If so then I think I kind of know what you mean.

                                                                                                        • TechDebtDevin 9 hours ago

                                                                                                          There's definately a bit of an uncanny valley in the land of top tier diffusion models. A generative video of someone smiling is way more likely to illicit this response for me than a generative image or single frame. It definately has something to do with the movement.

                                                                                                          • cousin_it 8 hours ago

                                                                                                            Yes, that's exactly it.

                                                                                                          • HKH2 15 hours ago

                                                                                                            Eyes have a lot of noise too.

                                                                                                          • mk_stjames 14 hours ago

                                                                                                            This was Schmidhuber's group is 2018:

                                                                                                            https://worldmodels.github.io/

                                                                                                            Just want to point that out.

                                                                                                            • hervature 3 hours ago

                                                                                                              I assume you are pointing this out because it is the first reference in the paper and getting the recognition it deserves and you are simply providing this link for convenience to those who do not go to the references.

                                                                                                              • mk_stjames an hour ago

                                                                                                                Yes, it was very nice to see it was the first citation in the paper (and cited several times throughout).

                                                                                                                The World Models paper is still one of the most amazing papers I've ever read. And I just really keep wanting to show that, in case people really don't see that, many in-the-know... knew.

                                                                                                              • afh1 12 hours ago

                                                                                                                Ahead of its time for sure. Dream is an accurate term here, that driving scene does resemble driving in dreams.

                                                                                                              • DrSiemer 15 hours ago

                                                                                                                Where it gets really interesting is if we can train a model on the latest GTA, plus maybe related real life footage, and then use it to live upgrade the visuals of an old game like Vice City.

                                                                                                                The lack of temporal consistency will still make it feel pretty dreamlike, but it won't matter that much, because the base is consistent and it will look amazing.

                                                                                                                • InsideOutSanta 14 hours ago

                                                                                                                  Just redrawing images drawn by an existing game engine works, and generates amazing results, although like you point out, temporal consistency is not great. It might interpret the low-res green pixels on a far-away mountain as fruit trees in one frame, and as pines in the next.

                                                                                                                  Here's a demo from 2021 doing something like that: https://www.youtube.com/watch?v=3rYosbwXm1w

                                                                                                                  • davedx 15 hours ago

                                                                                                                    A game like GTA has way too much functionality and complex branching for this to work I think (beyond eg doing aimless drives around the city — which would be very cool though)

                                                                                                                    • DrSiemer 8 hours ago

                                                                                                                      Gta 5 has everything Vice City has and more. In the Doom AI dream it's possible to shoot people. Maybe in this CS model as well?

                                                                                                                      I think the model does not have to know anything about the functionality. It can just dream up what is most probable to happen based on the training data.

                                                                                                                    • sorenjan 13 hours ago

                                                                                                                      In addition to the sibling comment's older example there's new work done with GTA too.

                                                                                                                      https://www.reddit.com/r/aivideo/comments/1fx6zdr/gta_iv_wit...

                                                                                                                      • DrSiemer 8 hours ago

                                                                                                                        Cool! Looks fairly consistent as well.

                                                                                                                        I wonder if this type of AI upscaling could eventually also fix things like slightly janky animations, but I guess that would be pretty hard without predetermined input and some form of look ahead.

                                                                                                                        Limiting character motion to only allow correct, natural movement would introduce a strange kind of input lag.

                                                                                                                    • skydhash 10 hours ago

                                                                                                                      Why not just creating the assets with higher resolution?

                                                                                                                      • DrSiemer 8 hours ago

                                                                                                                        Because that is a lot more work, will only work for a single game, potentially requires more resources to run and will not get you the same level of realism.

                                                                                                                      • empath75 4 hours ago

                                                                                                                        People focusing on the use of this in video games baffles me. The point isn't that it can regenerate a videogame world, the point is that it can simulate the _real world_. They're using video game footage to train it because it's cheap and easy to synthesize the data they need. This system doesn't know it's simulating a game. You can give it thousands or millions of hours of real world footage and agent input and get a simulation of the real world.

                                                                                                                        • taneq 5 hours ago

                                                                                                                          Using it as a visual upgrade is pretty close to what DLSS does so that sounds plausible.

                                                                                                                        • ilaksh 11 hours ago

                                                                                                                          I wonder if there is some way to combine this with a language model, or somehow have the language model in the same latent space or something.

                                                                                                                          Is that was vision-language models already do? Somehow all of the language should be grounded in the world model. For models like Gemini that can answer questions about video, it must have some level of this grounding already.

                                                                                                                          I don't understand how this stuff works, but compressing everything to one dimension as in a language model for processing seems inefficient. The reason our language is serial is because we can only make one sound at a time.

                                                                                                                          But suppose the "game" trained on was a structural engineering tool. The user asks about some scenario for a structure and somehow that language is converted to an input visualization of the "game state". Maybe some constraints to be solved for are encoded also somehow as part of that initial state.

                                                                                                                          Then when it's solved (by an agent trained through reinforcement learning that uses each dreamed game state as input?), the result "game state" is converted somehow back into language and combined with the original user query to provide an answer.

                                                                                                                          But if I understand properly, the biggest utility of this is that there is a network that understands how the world works, and that part of the network can be utilized for predicting useful actions or maybe answering questions etc. ?

                                                                                                                          • LarsDu88 10 hours ago

                                                                                                                            To combine with a language model simply replace the action vector with a language model latent.

                                                                                                                            Alternative as of last year there are now purely diffusion based text decoder models

                                                                                                                            • empath75 4 hours ago

                                                                                                                              Not everything needs to be a single giant neural network. You could have a bunch of weakly coupled specialized networks sending data back and forth over a normal api.

                                                                                                                            • mungoman2 15 hours ago

                                                                                                                              This is getting ridiculous!

                                                                                                                              Curious, since this is a strong loop old frame + input -> new frame, What happens if a non-CS image is used to start it off? Or a map the model has never seen. Will the model play ball, or will it drift back to known CS maps?

                                                                                                                              • Arch-TK 15 hours ago

                                                                                                                                Looks like it only knows Dust 2 since every single "dream" (I'm going to call them that since looking at this stuff feels like dreaming about Dust 2) is of that map only.

                                                                                                                              • fancyfredbot 15 hours ago

                                                                                                                                Strangely the paper doesn't seem to give much detail on the cs-go example. Actually the paper explicitly mentions it's limited to discrete control environments. Unless I'm missing something the mouse input for counterstrike isn't discrete and wouldn't work.

                                                                                                                                I'm not sure why the title says it was trained on 2x4090 either as I can't see this on either the linked page or the paper. The paper mentions a GPU year of 4090 compute was used to train the Atari model.

                                                                                                                                • c1b 15 hours ago

                                                                                                                                  CSGO model is only 1.5 gb & training took 12 days on a 4090

                                                                                                                                  https://github.com/eloialonso/diamond/tree/csgo?tab=readme-o...

                                                                                                                                  • fancyfredbot 15 hours ago

                                                                                                                                    Thanks, that's the detail I was looking for on the training. It's amazing results like this can be achieved at such a low costs! I thought this kind of work was out of reach for the GPU poor.

                                                                                                                                    The part about the continuous control still seems weird to me though. If anyone understands that then very interested to hear more.

                                                                                                                                • LarsDu88 10 hours ago

                                                                                                                                  Iterative denoising diffusion is such a hurdle for getting this sort of thing running at reasonable fps

                                                                                                                                  • shahzaibmushtaq 13 hours ago

                                                                                                                                    As I used to play CS 1.6 and CS: GO in my free time before the pandemic, this playable CS diffusion world map has been trained by a noob player for research purposes.

                                                                                                                                    After reading the comments I can assume that if you play outside of the scope it was trained on, the game loses its functionality.

                                                                                                                                    Nevertheless, R&D for a good cause is something we all admire.

                                                                                                                                    • crossroadsguy 13 hours ago

                                                                                                                                      How is the last version CS 2.0 (I think)? It’s been free to play like GO I guess. Is it like GO where physics felt too dramatised (could just be my opinion)? Or realistic in a snappy way like 1.6?

                                                                                                                                      • shahzaibmushtaq 11 hours ago

                                                                                                                                        Honestly, I heard about CS 2.0 from you. And you are right what you just said about GO.

                                                                                                                                    • ThouYS 15 hours ago

                                                                                                                                      I don't really understand the intuition on why this helps RL. The original game has a lot more detail, why can't it be used directly?

                                                                                                                                      • jampekka 15 hours ago

                                                                                                                                        It is used as a predictive model of the environment for model-based RL. I.e. agents can predict consequences of their actions.

                                                                                                                                        • ThouYS 15 hours ago

                                                                                                                                          Oh, I see. I was somehow under the impression that the simulation was the game the RL agent learns to play (which kinda seemed nonsensical).

                                                                                                                                        • visarga 14 hours ago

                                                                                                                                          It can use the game directly but if you try this with real life robots, then it is better to do neural simulation before performing an action that could result in injury or damage. We don't need to fall with our cars off the road many times to learn to drive on the road because we can imagine the consequences. Same thing here.

                                                                                                                                          • FeepingCreature 15 hours ago

                                                                                                                                            In the real world, you can't just boot up a copy of reality to play out strategies. You need an internal model.

                                                                                                                                            • tourmalinetaco 15 hours ago

                                                                                                                                              So, effectively, these video game models are proof-of-concepts to say “we can make models with extremely accurate predictions using minimal resources”?

                                                                                                                                              • usrusr 13 hours ago

                                                                                                                                                Not sure where you see the "minimal resources" here? But I'd just counter all questions about "why" with the blanket response of "for understanding natural intelligence": the way biology innovates is that it throws everything against the wall and not pick the one thing that sticks as the winner and focus on that mechanism, it keeps the sticky bits and also everything else as long as their cost isn't prohibitive. Symbolic modeling ("this is an object that can fall down"), prediction chains based on visual similarity patterns (this), hardwired reflexes (we tend to not trust anything that looks and moves like a spider or snake) and who knows what else, it's all there, it all runs in parallel, invited or not, and they all influence each other in subtle and less subtle ways. The interaction is not engineered, it's more like crosstalk that's allowed to happen and has more upside than downside, or else evolution would have preferred variations of the setup that have less of the kind of crosstalk in question. But in our quest to understand us, it's super exciting to see candidates for processes that perhaps play some role in our minds, in isolation, no matter if that role is big or small.

                                                                                                                                                • vbezhenar 13 hours ago

                                                                                                                                                  May be I'm wrong but my understanding is that you can film some area using, say, dashcams and then generate this kind of neuro model. Then you can train robot to walk in this area with this neuro-model. It can perform billions of training sessions without touching physical world. Alternatively you can somehow perform 3D scan of area, recreate its 3D model and use, say, game engine to simulate, but that probably requires more effort and not necessarily better.

                                                                                                                                                  • usrusr 13 hours ago

                                                                                                                                                    And the leg motions we sometimes see in sleeping dogs suggest that this is very much a way how having dreams is useful!

                                                                                                                                                  • empath75 4 hours ago

                                                                                                                                                    Yes, this is exactly right. What they need is a giant dataset of agent data and audio and video from real world locomotion.

                                                                                                                                              • Zealotux 13 hours ago

                                                                                                                                                Could we imagine parts of game elements to become "targets" for models? For example hair and fur physics have been notoriously difficult to nail, but it should be easier to use AI to simulate some fake physics on top of the rendered frame, right? Is anyone working on that?

                                                                                                                                                • thenthenthen 15 hours ago

                                                                                                                                                  When my game starts to look like this, I know it is time to quit hahha, maybe a helpful tool in gaming addiction therapy? The morphing of the gun/skins and the environment (the sandbags) wow. Would like to play this and see what happens when you walk backwards, turn around quick, use ‘noclip’ :D

                                                                                                                                                  • advael 15 hours ago

                                                                                                                                                    Dang this is the first paper I've seen in a while that makes me think I need new GPUs

                                                                                                                                                    • w-m 15 hours ago

                                                                                                                                                      If you're not bored with it yet, here's a Deep Dive (NotebookLM, generated podcast). I fed it the project page, the arXiv paper, the GitHub page, and the two twitter threads by the authors.

                                                                                                                                                      https://notebooklm.google.com/notebook/a240cb12-8ca1-41b4-ab... (7m59s)

                                                                                                                                                      As always, it's not actually much of a technical deep dive, but gives a quite decent overview of the pieces involved, and its applications.

                                                                                                                                                      • thierrydamiba 14 hours ago

                                                                                                                                                        How did you get the output to be so long? My podcasts are 3 mins max…

                                                                                                                                                        • w-m 8 hours ago

                                                                                                                                                          Oh wow, really? Even if you feed it whole research papers? The ones I tried until now were more in the 8-10 minute range. I haven’t looked in to how to control the output yet. Hopefully that’ll get a little more transparent and controllable soon.

                                                                                                                                                      • delusional 15 hours ago

                                                                                                                                                        I just checked it out right quick. It works perfectly well on an AMD card with ROCM pytorch.

                                                                                                                                                        It seems decent in short bursts. As it goes on it quite quickly loses detail and the weapon has a tendency to devolve into colorful garbage. I would also like to point out that none of the videos show what happens when you walk into a wall. It doesn't handle it very gracefully.

                                                                                                                                                        • gadders 13 hours ago

                                                                                                                                                          Cool achievement, but I want AI to give me smarter NPCs, not simulate the map.

                                                                                                                                                          • thelastparadise 12 hours ago

                                                                                                                                                            The NPCs need a model of the world in their brain in order to act normal.

                                                                                                                                                          • styfle 12 hours ago

                                                                                                                                                            But does it work on macOS?

                                                                                                                                                            (The latest CS removed support for macOS)

                                                                                                                                                            • akomtu 6 hours ago

                                                                                                                                                              The current batch of ML models looks a lot like filling in holes in the wall of text, drawings or movies: you erase a part of the wall and tell it to fix it. And it fills in the hole using colors from the nearby walls in the kitchen and similar walls and we watch this in awe thinking it must've figured out the design rules of the kitchen. However what it's really done is it interpolated the gaps with some sort of basic functions, trigonometric polynomials for example, and it used thousands of those. This solution wouldn't occur to us because our limited memory isn't enough for thousands of polynomials: we have to find a compact set of rules or give up entirely. So when these ML models predict the motion of planets, they approximate the Newton's law with a long series of basic functions.

                                                                                                                                                              • iwontberude 13 hours ago

                                                                                                                                                                This is crazy looking, I know it’s basically useless but it’s cool anyways.

                                                                                                                                                                • mixtureoftakes 14 hours ago

                                                                                                                                                                  this is crazy

                                                                                                                                                                  when trying to run on a mac it only plays in a very small window, how could this be configured?

                                                                                                                                                                  • 6510 15 hours ago

                                                                                                                                                                    Can it use a seed that makes the same map every time?

                                                                                                                                                                    • madaxe_again 15 hours ago

                                                                                                                                                                      I earnestly think this is where all gaming will go in the next five years - it’s going to be so compelling that stuff already under development will likely see a shift to using diffusion models. As this is demonstrating, a sufficiently honed model can produce realtime graphics - and some of the demos floating around where people are running GTA San Andreas through non-realtime models hint as to where this will go.

                                                                                                                                                                      I give it the same five years before there are games entirely indistinguishable from reality, and I don’t just mean graphical fidelity - there’s no reason that the same or another model couldn’t provide limitless physics - bust a hole through that wall, set fire to this refrigerator, whatever.

                                                                                                                                                                      • qayxc 15 hours ago

                                                                                                                                                                        I think you're missing the most important point: these models need to be trained on something and that something is a fully developed, working game.

                                                                                                                                                                        You're basically saying that game development would need to do the work twice: step 1: develop a fully functional game, step 2: spend ridiculous effort (in terms of time and compute) on training a model to emulate the game in a half-baked fashion.

                                                                                                                                                                        It's a solution looking for a problem.

                                                                                                                                                                        • manmal 15 hours ago

                                                                                                                                                                          The world model can still be rendered in very low res, and then the diffusion skin/remaster is applied.

                                                                                                                                                                          And this would also be an exciting route to go at remastering old games. I‘d pay a lot to play NFS Porsche again, with photorealism. Or imagine Command & Conquer Red Alert, „rendered“ with such a model.

                                                                                                                                                                          • qayxc 15 hours ago

                                                                                                                                                                            NVIDIA's RTX Remix [1] suite of tools already does that. It doesn't require any model training or dozens of hours of pre-recorded gameplay either.

                                                                                                                                                                            You can drop in low-res textures and have AI tools upscale them. Models can be replaced, as well as lighting and the best part: it's all under your control. You're not at the merci of obscure training material that might or might not result in a consistent look-and-feel. More knobs, more control, less compute required.

                                                                                                                                                                            [1] https://www.nvidia.com/en-us/geforce/rtx-remix/

                                                                                                                                                                            • manmal 14 hours ago

                                                                                                                                                                              TIL, thanks for posting. The workflow I was sketching out is simpler though: Render a legacy game or low fidelity modern game as-is, and run it through a diffusion model in real time.

                                                                                                                                                                          • empath75 4 hours ago

                                                                                                                                                                            No, you use it to simulate things that we don't have efficient perfect models of -- like the actual world. Everyone is correct that using this to simulate counterstrike is pointless. This is not video game technology, this is autonomous agent technology -- training robots to predict and navigate the real world.

                                                                                                                                                                            • FeepingCreature 15 hours ago

                                                                                                                                                                              You can crosstrain on reality.

                                                                                                                                                                          • casenmgreen 15 hours ago

                                                                                                                                                                            Not a chance.

                                                                                                                                                                            There are fundamental limitations with what are in the end all essentially neural nets; there is no understanding, only prediction. Prediction alone is not enough to emulate reality, which is why for example genuinely self-driving cars have not, and will not, emerge. A fundamental advance in AI technology will be required for that, something which leads to genuine intelligence, and we are no closer to that than ever we were.

                                                                                                                                                                            • fancyfredbot 15 hours ago

                                                                                                                                                                              Looking at the examples of 2600 games in the paper I'm not sure you can tell that they are just predictions.

                                                                                                                                                                              Have you considered how you'd tell the difference between a prediction and understanding in practice?

                                                                                                                                                                              • francoisfleuret 15 hours ago

                                                                                                                                                                                "there is no understanding, only prediction"

                                                                                                                                                                                I have no idea what this means.

                                                                                                                                                                                • nonrandomstring 14 hours ago

                                                                                                                                                                                  > > "there is no understanding, only prediction"

                                                                                                                                                                                  > I have no idea what this means.

                                                                                                                                                                                  You can throw a ball up in the air and predict that it will fall again and bounce. You have no understanding of mass, gravity, acceleration, momentum, impulse, elasticity...

                                                                                                                                                                                  You can press a button that makes an Uber car appear in reality and take you home. You have no understanding of apps, operating systems, radio, internet, roads, wheels, internal combustion engines, driving, GPS, maps...

                                                                                                                                                                                  This confusion of understanding and prediction affects a lot of people who use technology in a "machine-like" way, purely instrumental and utilitarian... "how does this get me what I want immediately?"

                                                                                                                                                                                  You can take any complex reality and deflate it, abstract it, reduce it down to a mere set of predictions that preserve all the utility for a narrow task (in this case visual facsimile) but strip away all depth of meaning. The models, of both the system and the internal working model of the user are flattened. In this sense "AI" is probably the greatest assault on actual knowledge since the book burning under totalitarian regimes of the mid 20th century.

                                                                                                                                                                                  • binary132 14 hours ago

                                                                                                                                                                                    I think GP is saying that understanding is measured by predictive capability of the theory

                                                                                                                                                                                    and in case you hadn’t noticed, that kind of uncomprehending slopthink has been going on for a lot longer than the AI fad

                                                                                                                                                                                    • GaggiX 14 hours ago

                                                                                                                                                                                      What if the model actually understands that the ball will fall and bounce because of mass, gravity, acceleration, momentum, impulse, elasticity? I mean you can just ask ChatGPT and Claude, I guess you would answer that in this case it's just prediction, but if they were human then it would be understanding.

                                                                                                                                                                                      • nonrandomstring 14 hours ago

                                                                                                                                                                                        > I guess you would answer that in this case it's just prediction,

                                                                                                                                                                                        No I would answer that it is indeed understanding, to upend your "guess" (prediction) and so prove that while you think you can "predict" the next answer you lack understanding of what the argument is really about :)

                                                                                                                                                                                        • GaggiX 13 hours ago

                                                                                                                                                                                          I think I understand the topic quite well, since you deliberately deviate from answering the question. You made a practical example that doesn't really work in practice.

                                                                                                                                                                                    • tourmalinetaco 15 hours ago

                                                                                                                                                                                      The MLM has no idea what it’s making, where you are in the map, what you left behind, and what you picked up. It can accurately predict what comes next, but if you pick up an item and do a 360° turn the item will be back and you can repeat the process.

                                                                                                                                                                                      • GaggiX 15 hours ago

                                                                                                                                                                                        When a human does it, it's understanding, when an AI does it, it's prediction, I thinks it's very clear /s

                                                                                                                                                                                        • therouwboat 14 hours ago

                                                                                                                                                                                          Does what? In normal game world things tend to stay where they are without player having to do anything.

                                                                                                                                                                                          • GaggiX 13 hours ago

                                                                                                                                                                                            We are talking about neural networks in general, not this one or that one, if you train a bad model or the model is untrained it would not indeed understand much or anything.

                                                                                                                                                                                      • killerstorm 15 hours ago

                                                                                                                                                                                        That's bs. You have no understanding of understanding.

                                                                                                                                                                                        Hooke's law was pure curve-fitting. Hooke definitely did not understand the "why". And yet we don't consider that bad physics.

                                                                                                                                                                                        Newton's laws can be derived from curve fitting. How is that different from "understanding"?

                                                                                                                                                                                        • madaxe_again 14 hours ago

                                                                                                                                                                                          Einstein couldn’t even explain why general relativity occurred. Sure, spacetime is curved by mass, but why? What a loser.

                                                                                                                                                                                          • killerstorm 12 hours ago

                                                                                                                                                                                            It's very illustrative to look into the history of discovery of laws of motion, as it's quite well documented.

                                                                                                                                                                                            People have an intuitive understanding of motion - we see it literally every day, we throw objects, etc.

                                                                                                                                                                                            And yet it took literally thousands of years since discovery of mathematics (geometry, etc.) to formulate a concept of force, momentum, etc.

                                                                                                                                                                                            Ancient Greek mathematicians could do integration, so they were not lacking mathematical sophistication. And yet their understanding of motion was so primitive:

                                                                                                                                                                                            Aristotle, an extremely smart man, was muttering something about "violent" and "natural" motion: https://en.wikipedia.org/wiki/Newton%27s_laws_of_motion#Anti...

                                                                                                                                                                                            People started to understand the conservation of quantity of motion only in 17th century.

                                                                                                                                                                                            So we have two possibilities:

                                                                                                                                                                                            * everyone until 17th century was dumb af (despite being able to do quite impressive calculations)

                                                                                                                                                                                            * scientific discovery is really a heuristic-driven search process where people try various things until they find a good fit

                                                                                                                                                                                            I.e. millions of people were somehow failing to understand motion for literally thousands of years until they collected enough assertions about motion that they were able to formulate the rule of conservation, test it, and confirm it fits. And only then it became understanding.

                                                                                                                                                                                            You can literally see conservation of momentum on a billiard table: you "violently" hit one ball, it hits other balls and they start to move, but slower, etc. So you really transfer something from one ball to the rest. And yet people could not see it for thousands of years.

                                                                                                                                                                                            What this shows is that there's nothing fundamental about understanding: it's just a sense of familiarity, it is a sense that your model fits well. Under the hood it's all prediction and curve fitting.

                                                                                                                                                                                            We literally have prediction hardware in our brains: cerebellum has specialized cells which can predict, e.g. motion. So people with damaged cerebellum have impaired movement: they still can move, but their movement are not precise. When do you think we find specialized understanding cells in the human brain?

                                                                                                                                                                                            • mrob 12 hours ago

                                                                                                                                                                                              It seems to me that your evidence supports the exact opposite of your conclusion. Familiarity was only enough to find ad-hoc heuristics for specific situations. It let us discover intuitive methods to throw stones, drive carts, play ball games, etc. but never discovered the general principle behind them. A skilled archer does not automatically know that the same rules can be used to aim a mortar.

                                                                                                                                                                                              Ad-hoc heuristics are not the same thing as understanding. It took formal reasoning for humans to actually understand motion, of a type that modern AI does not use. There is something fundamental about understanding that no amount of familiarity can substitute for. Modern AI can gain enormous amounts of familiarity but still fail to understand, e.g. this Counter-Strike simulator not knowing what happens when the player walks into a wall.

                                                                                                                                                                                              • killerstorm 10 hours ago

                                                                                                                                                                                                People found that `m * v` is the quantity which is conserved.

                                                                                                                                                                                                There's no understanding. It's just a formula which matches the observations. It also matches our intuition (a heavier object is hard to move, etc), and you feel this connection as understanding.

                                                                                                                                                                                                Centuries later people found that conservation laws are linked to symmetries. But again, it's not some fundamental truth, it's just a link between two concepts.

                                                                                                                                                                                                LLM can link two concepts too. So why do you believe that LLM cannot understand?

                                                                                                                                                                                                I middle school I did extremely well in physics classes - I could solve complex problems which my classmates couldn't because I could visualize the physical process (e.g. motion of an object) and link that to formulas. This means I understood it, right?

                                                                                                                                                                                                Years later I thought "But what *is* motion, fundamentally?". I grabbed Landau-Lifshitz mechanics textbook. How do they define motion? Apparently, bodies move in a way to minimize some integral. They can derive the rest from it. But it doesn't explain what a motion is. Some of the best physicists in the world cannot define it.

                                                                                                                                                                                                So I don't think there's anything to understanding except feeling of connection between different things. "X is like Y except for Z".

                                                                                                                                                                                                • mrob 10 hours ago

                                                                                                                                                                                                  Understanding is finding the simplest general solution. Newton's laws are understanding. Catching the ball is not. LLMs take billions of parameters to do anything and don't even generalize well. That's obviously not understanding.

                                                                                                                                                                                                  • killerstorm 9 hours ago

                                                                                                                                                                                                    You're confusing two meanings of world "understanding":

                                                                                                                                                                                                    1. Finding a comprehensive explanation

                                                                                                                                                                                                    2. Having a comprehensive explanation which is usable

                                                                                                                                                                                                    99.999% people on Earth do not discover any new laws, so I don't think you use #1 as a fundamental deficiency of LLMs.

                                                                                                                                                                                                    And nobody is saying that just training a LLM produces understanding of new phenomena. That's a strawman.

                                                                                                                                                                                                    The thesis is that a more powerful LLM together with more software, more models, etc, can potentially discover something new. That's not observed yet. But I'd say it would be weird if LLM can match capabilities of average folk but never match Newton. It's not like Newton's brain is fundamentally different.

                                                                                                                                                                                                    Also worth noting that formulas can be discovered by enumeration. E.g. `m * v` should not be particularly hard to discover. And the fact that it took people centuries implies that that's what happened: people tried different formulas until they found one which works. It doesn't have to be some fancy Newton magic.

                                                                                                                                                                                                    • mrob 9 hours ago

                                                                                                                                                                                                      I'm certain that people did not spend centuries trying different formulas for the laws of motion before finding one that worked. The crucial insight was applying any formula at all. Once you have that then the rest is relatively easy. I don't see LLMs making that kind of discovery.

                                                                                                                                                                                        • madaxe_again 13 hours ago

                                                                                                                                                                                          Yet we have no understanding, only prediction. We can describe a great many things in detail, how they interact - and we can claim to understand things, yet if you recursively ask “why?” everybody, and I mean everybody, will reach a point where they say “I don’t know” or “god”.

                                                                                                                                                                                          An incomplete understanding is no understanding at all, and I would argue that we can only predict, and we can certainly emulate reality, otherwise we would not be able to function within it. A toddler can emulate reality, anticipate causality - and they certainly can’t be said to be in possession of a robust grand unified theory.

                                                                                                                                                                                          • jiggawatts 15 hours ago

                                                                                                                                                                                            For simulations like games, it's a trivial matter to feed the neural game engine pixel-perfect metadata.

                                                                                                                                                                                            Instead of rendering the final shaded and textured pixels, the engine would output just the material IDs, motion vectors, and similar "meta" data that would normally be the inputs into a real-time shader.

                                                                                                                                                                                            The AI can use this as inputs to render a photorealistic output. It can be trained using offline-rendered "ground-truth" raytraced scenes. Potentially, video labelled in a similar way could be used to give it a flair of realism.

                                                                                                                                                                                            This is already what NVIDIA DLSS and similar AI upscaling tech uses. The obvious next step is not just to upscale rendered scenes, but to do the rendering itself.

                                                                                                                                                                                          • viraptor 15 hours ago

                                                                                                                                                                                            It's not that great yet.

                                                                                                                                                                                            Given a model which can generate the game view in ~real time and a model which can generate the models and textures, why would you ever use the first option, apart from a cool tech demo? I'm sure there's space for new dreamy games where invisible space behind you transforms when you turn around, but for other genres... why? Destructible environment has been possible for quite a while, but once you allow that everywhere, you can get games into unplayable state. They need to be designed around that mechanic to work well: Noita, Worms, Teardown, etc. I don't believe the "limitless physics" would matter after a few minutes.

                                                                                                                                                                                            • Arch485 15 hours ago

                                                                                                                                                                                              It seems extremely unlikely to me that ML models will ever run entire games. Nobody wants a game that's "entirely indistinguishable from reality" anyways. If they did, they would go outside.

                                                                                                                                                                                              I think it's possible specific engine components could be ML-driven in the future, like graphics or NPC interactions. This is already happening to a certain degree.

                                                                                                                                                                                              Now, I don't think it's impossible for an ML model to run an entire game. I just don't think making + running your game in a predictive ML model will ever be more effective than making a game the normal way.

                                                                                                                                                                                              • jsheard 15 hours ago

                                                                                                                                                                                                Yep, the fuzziness and opaqueness of ML models makes developing an entire game state inside one a non-starter in my opinion. You need precise rules, and you need to be able to iterate on those rules quickly, neither of which are feasible with our current understanding of ML models. Nobody wants a version of CS:GO where fundamental constants like weapon damage run on dream logic.

                                                                                                                                                                                                If ML has any place in games it's for specific subsystems which don't need absolute precision, NPC behaviour, character animation, refining the output of a renderer, that kind of thing.

                                                                                                                                                                                              • advael 15 hours ago

                                                                                                                                                                                                I'm not sure that's a warranted assumption based on this result, exciting as it is, we are still seeing replication of an extant testable world model, rather than extrapolation that can produce novel mechanics without them being in the training data. I'm not saying this isn't a stepping stone to that, I just think your prediction's a little optimistic based on the scope of that problem

                                                                                                                                                                                                • TinkersW 15 hours ago

                                                                                                                                                                                                  It requires a monster GPU to run 10 fps at what looks like sub 720p... I think it may be abit more than 5 years..

                                                                                                                                                                                                • snickerer 15 hours ago

                                                                                                                                                                                                  I see where this is going.

                                                                                                                                                                                                  The next step to create training data is a real human with a bodycam. There is only the need to connect the real body movement (step forward, turning left, etc) to typical keyboard and mouse game control events, to feed them into the model, too.

                                                                                                                                                                                                  I think that is what the devs here are dreaming about.

                                                                                                                                                                                                  • CaptainFever 14 hours ago

                                                                                                                                                                                                    Or a cockpit cam for the world's most realistic flight simulator. /lighthearted

                                                                                                                                                                                                    • devttyeu 15 hours ago

                                                                                                                                                                                                      The "We live in a simulation" argument just started looking a lot more conceivable.

                                                                                                                                                                                                      • tiborsaas 12 hours ago

                                                                                                                                                                                                        I'm already very suspicious, we just got the same room number in the third hotel in a row. Someone got lazy with the details :)

                                                                                                                                                                                                        • iwontberude 13 hours ago

                                                                                                                                                                                                          Not really because is it simulators all the way down? Simulation theory explains nothing and only adds more unexplainable phenomenon.

                                                                                                                                                                                                        • devttyeu 15 hours ago

                                                                                                                                                                                                          Could probably make a decent dataset from VR headset tracking cameras + motion sensors + passthrough output + decoded hand movements

                                                                                                                                                                                                        • TealMyEal 15 hours ago

                                                                                                                                                                                                          What's the end goal here? personalised games for everyone. ultra-graphics. i dont really see how this is going to be better than our engine based systems.

                                                                                                                                                                                                          I love being a horse in the 1900s that automobilewill never take off /s

                                                                                                                                                                                                          • visarga 14 hours ago

                                                                                                                                                                                                            The goal is to train agents that can imagine consequences before acting. But it could also become a cheap way to create experiences and user interfaces on the fly, imagine if you can have any app UI dreamed up like that, not just games. Generative visual interfaces could be a big leap over text mode.

                                                                                                                                                                                                            • qayxc 15 hours ago

                                                                                                                                                                                                              It's a research paper. Not everything that comes out of research has an immediate real-world application in mind.

                                                                                                                                                                                                              Games are just an accessible and easy to replicate context to work in, they're not an end goal or target application.

                                                                                                                                                                                                              The research is about AI agents interacting with and creating world models. Such world models could just as well be alien environments - i.e. the kind of stuff an interstellar and even interplanetary probe would need to be able to do, as two-way communication over large distances is impractical.