Comments Page - Diffusion for World Modeling

« Back Diffusion for World Modelingdiamond-wm.github.ioSubmitted by francoisfleuret 9 months ago

smusamashah 9 months ago
This video https://x.com/Sentdex/status/1845146540555243615 looks way too much like my dreams. This is almost exactly that happens when I sometimes try to jump high, it transforms me to a different place just like that. Things keep changing just like that. It's amazing to see how close it is to a real dream experience.
- kleene_op 9 months ago
  I noticed that all text looked garbled up when I had some lucid dreams. When diffusion models started to gain attention, I made the connection that text generated in generated images also looked garbled up.
  Maybe all of those are clues that parts of the human subconscious mind operate pretty close to the principles behind diffusion models.
  smusamashah 9 months ago
  I also lucid dream occasionally. Very rarely things are very detailed, most often the colors and details are just as bleak and blurry and keep changing as these videos. I walk down a street, take a turn (or not), its almost guaranteed I can't go back to where I came from. I usually appreciate when I can track back the same path.
  com2kid 9 months ago
  I've watched entire movies while lucid dreaming. I've listened to entire concerts play out, eaten entire meals with all 5 senses.
  One of the more annoying parts of growing older was that I stopped lucid dreaming.
  qwertox 9 months ago
  I don't think lucid dreaming is a requirement for this. Whenever I dream my environment morphs into another one, scene by scene, things I try to get details from, like the content of a text, refuse to show clearly enough to extract any meaningful information from it, no matter what I try.
  throwaway290 8 months ago
  "I try" sounds like lucid dreams though? In my dreams there is no try, things just happen (including my actions). I think I get meaningful information sometimes, but it's selective
  sci_prog 9 months ago
  Also the AI generated images that can't get the fingers right. Have you ever tried to look at your hands while lucid dreaming and try counting fingers? There are some really interesting parallels between the dreams and diffusion models.
  dartos 9 months ago
  Of course, due to the very nature of dreams, your awareness of diffusion models and their output flavors how you perceive even past dreams.
  Our brains love retroactively altering fuzzy memories.
  hombre_fatal 9 months ago
  On the other hand, psychedelics give you perceptions similar to even early deepdream genai images.
  On LSD, I was swimming in my friend’s pool (for six hours…) amazed at all the patterns on his pool tiles underwater. I couldn’t get enough. Every tile had a different sophisticated pattern.
  The next day I went back to his place (sober) and commented on how cool his pool tiles were. He had nfi what I was talking about.
  I walk out to the pool and sure enough it’s just a grid of small featureless white tiles. Upon closer inspection they have a slight grain to them. I guess my brain was connecting the dots on the grain and creating patterns.
  It was quite a trip to be so wrong about reality.
  Not really related to your claim I guess but I haven’t thought of this story in 10 years and don’t want to delete it.
  Jackson__ 9 months ago
  This may be a joke, but counting your fingers to lucid dream has been a thing for a lot longer than diffusion models.
  That being said, your reality will influence your dreams if you're exposed to some things enough. I used to play minecraft on a really bad PC back in the day, and in my lucid dreams I used to encounter the same slow chunk loading as I saw in the game.
  timschmidt 9 months ago
  Playing Population One in VR did this to me. Whenever I hopped into a new game, I'd ask the other participants if they'd had particularly vivid dreams since getting VR, and more than half of folks said they had.
- siavosh 9 months ago
  What’s amazing is that if you really start paying attention it seems like the mind is often doing the same thing when you’re awake, less noticeable with your visual field but more noticeable with attention and thoughts themselves.
  smusamashah 9 months ago
  This is a very interesting thought. I never thought of mind doing anything like that in wake state. I know I will now be thinking about this idea every time I recall those dreams.
  TaylorAlexander 9 months ago
  Yeah I also hadn’t but it makes sense. Just like all output from an LLM is a “hallucination” but we have a tendency to only call it a hallucination when something looks wrong about the result, we forget that our conscious lived experience is a hallucination based on a bunch of abstract sensory data that our brain fuses in to a world state experience. It’s obvious when we are asleep that dreams are hallucinations, but it is less obvious that the conscious experience is too.
  siavosh 9 months ago
  Agree that the whole senate field is a type of hallucination. But when it comes to thoughts appearing in our minds their transitions are much more classically hallucination like. I became aware of this during meditation. It has a strange intoxicating quality where you have to figuratively pinch yourself to notice how bizarrely one thought transitions to a thought in the periphery of your attention.
- voidUpdate 9 months ago
  Its interesting how much dreams differ from person to person. Mine tend to be completely coherant visually, to the point that I have used google maps in my dreams, and while the geography was inaccurate, it was consistent. However, I have never been lucid within a dream, maybe that makes a difference
- jvanderbot 9 months ago
  This is why I'm excited in a limited way. Clearly something is disconnected in a dream state that has an analogous disconnect here.
  I think these models lack a world model, something with strong spatial reasoning and continuity expectations that animals have.
  Of course that's probably learned too.
- earnesti 9 months ago
  That looks way too much to the one time I did DMT-5
  TechDebtDevin 9 months ago
  Machine Elves
  loxias 9 months ago
  IYKYK
- soheil 9 months ago
  How are you so sure this is like your dreams? If it was easy to accurately remember dreams why would they be all so smooshy and such a jumbled mess like in this video?
- thegabriele 9 months ago
  We are unconsciously (pun intended) implementing how brains work both in dream and wake states. Can't wait until we add some kind of (lossless) memory to this models.
  hackernewds 9 months ago
  Any evidence to back this lofty claim?
  sweeter 9 months ago
  vibes
  soraki_soladead 9 months ago
  We have lossless memory for models today. That's the training data. You could consider this the offline version of a replay buffer which is also typically lossless.
  The online, continuous and lossy version of this problem is more like how our memory works and still largely unsolved.
francoisfleuret 9 months ago
This is 300M parameters model (1/1300th of the big llama-3) trained with 5M frames with 12 days of a GTX4090.
This is what a big tech company was doing in 2015.
The same stuff at industrial scale à la large LLMs would be absolutely mind blowing.
- gjulianm 9 months ago
  What exactly would be the benefit of that? We already have Counter Strike working far more smooth than this, without wasting tons of compute.
  ben_w 9 months ago
  As with diffusion models in general, the point isn't the specific example but that it's generalisable.
  5 million frames of video data with corresponding accelerometer data, and you get this for genuine photorealism.
  gjulianm 9 months ago
  Generalisable how? The model completely hallucinates invalid input, it's not even high quality and required CSGO to work. What's the output you expect from this and what alternatives are there?
  Art9681 9 months ago
  None of those questions are relevant are they? I get the impression you've already decided this isnt good enough, which is basically agreeing with everyone else. No one is talking about what it's capable of today. Read the thread again. We're imagining the great probability a few permutations later this thing will basically be The Matrix.
  pms 8 months ago
  Are we supposed to believe in the usefulness of AI as if it was a matter of belief? The one and only Matrix to come?
  ben_w 9 months ago
  It did not require CSGO, that was simply one of their examples. The very first video in the link shows a bunch of classic Arati games, and even the video which is showing CSGO is captioned "DIAMOND's diffusion world model can also be trained to simulate 3D environments, such as CounterStrike: Global Offensive (CSGO)" — I draw your attention to "such as" being used rather than "only".
  And I thought I was fairly explicit about video data, but just in case that's ambiguous: the stuff you record with your phone camera set to video mode, synchronised with the accelerometer data instead of player keyboard inputs.
  As for output, with the model as it currently stands, I'd expect a 24h training video at 60fps to be "photorealisic and with similar weird hallucinations". Which is still interesting, even without combining this with a control net like Stable Diffusion can do.
  empath75 9 months ago
  You do the same thing at a larger scale, and instead of video game footage you use a few million hours of remote controlled drone input in the real world.
  ramses0 9 months ago
  That's like one INSTANT of youtube / facebook live.
  stale2002 9 months ago
  To answer your question directly, the benefit is that we could make something different from counter strike.
  You see, there are these things called "proof of concept"s that are meant to not be a product, but instead show off capabilities.
  Counterstrike is an example, meant to show off complex capabilities. It is not meant to show how the useful thing of these models is to literally recreate counterstrike.
  gjulianm 9 months ago
  Which capabilities are being shown off here? The ability to take an already existing world-model and take lots of compute to have a worse, less correct model?
  stale2002 9 months ago
  The capability to have mostly working, real time generation of images that represent a world model.
  If that capability is possible, then it could be possible to take 100 examples of seperate world models that exist, and then combine those world models together in interesting ways.
  Combining together world models is an obvious next step (IE, not showed off in this proof of concept. But it is a logical/plausible future capability).
  Having multiple world models combined together in new and interesting ways, is almost like creating an entirely new world model, even though thats not exactly the same.
  FridgeSeal 9 months ago
  Gosh I’m so excited for AAA games to get _even more garbage_ once this gets good enough for investor-driven studios to start slinging out AI driven, hyper-derivative, low-effort slop with even less effort.
  We all thought Sports-Game $CURRENT_YEAR or CoD/MW MicroTransaction-Garbage was bottom of the barrel, a whole new world of fresh hell awaits us!
  stale2002 9 months ago
  Investor driven?
  You have it all wrong. It's not going to be AAA studios doing this. Instead, it will be randos who have a cool/fun concept that they want to try out with a couple friends.
  Sure, most will be bad. But there might be some gems in there that go viral.
  Making it easier for regular people to experiment with games is a good thing.
  Or even better, what if this allows people to throw something together to see if a game mechanic is fun, and once it's tested out, then they can make the game for "real".
  Allowing quicker experimentation times is also a good thing.
  6r17 9 months ago
  You'd have to have the rights to do the training wouldn't you ? Or does that mean I should close off all of my creations legally just so you can't use them ?
  I don't have a problem with you using this with a Camera and the real world or your creations ; but I do have a problem when people are able to use someone's work and use these blend and call them original.
  It's just as I would have taken a 3d model static mesh, apply some blend on it and call it my own.
  No it's f* not.
  stale2002 9 months ago
  > you'd have to have the rights to do the training wouldn't you ?
  Thats undecided by the courts. Everyone is training using other people's data right now though, and very few companies are even being sued (let alone have finished the multi-year process to actually be punished for it).
  > Or does that mean I should close off all of my creations
  When you put something out there, you should expect that other people are going to use those creations. Almost certainly in many ways that you don't approve over.
  Don't release something publicly if you don't want other people to use it.
  > but I do have a problem
  Your anger isn't particularly useful. What are you going to do about it? This particular proof of concept model was made with 1 single GPU running for 2 weeks. You can't stop that.
  > It's just as I would have taken a 3d model static mesh, apply some blend on it and call it my own.
  Something that I am sure many people are doing all the time, already. Transforming and using other people's content is as old as the internet. AI does little to change that.
  6r17 9 months ago
  You misunderstand that I'm also a consumer and that I look upon these things. If I see an someone bring up some AI stuff and call themselves an artist I know how I look at them.
  undefined 9 months ago
  [deleted]
  nuz 9 months ago
  "What would be the point of creating a shooter set in the middle east? We already have pong and donkey kong"
  eproxus 9 months ago
  But please, think of the shareholders!
- GaggiX 9 months ago
  If 12 days with an RTX4090 is all you need, some random people on the Internet will soon start training their own.
  rizky05 9 months ago
  [dead]
- cs702 9 months ago
  Came here to say pretty much the same thing, and saw your comment.
  The rate of progress has been mind-blowing indeed.
  We sure live in interesting times!
- Sardtok 9 months ago
  Two 4090s, but yeah.
  Sardtok 9 months ago
  Never mind, the repo on Github says 12 days on a 4090, so I'm unsure why the title here says two.
marcyb5st 9 months ago
So, this is pretty exciting.
I can how this can already be used to generate realistic physics approximations in a game engine. You create a bunch of snippets of gameplay using a much heavier and realistic physics engine (perhaps even CGI). The model learn to approximate the physics and boom, now you have a lightweight physics engine. Perhaps you can even have several that are specialized (e.g. one for smoke dynamics, one for explosions, ...). Even if it allucinates, wouldn't be worse than the physics bugs that are so common in games.
- monsieurbanana 9 months ago
  > Even if it allucinates, wouldn't be worse than the physics bugs that are so common in games.
  I don't know about that. Physic bugs are common, but you can prioritize and fix the worst (gamebreaking) ones. If you have a blackbox model, it becomes much harder to do that.
- bobsomers 9 months ago
  What makes you think the network inference is less expensive? Newtonian physics is already extremely well known and pretty computationally efficient to compute.
  How would a "function approximation" of Newtonian physics, with billions of parameters, be cheaper to compute?
  It seems like this would both be more expensive and less correct than a proper physics simulation.
  crackalamoo 9 months ago
  Basic Newtonian physics is pretty efficient to compute, but afaik some more complex physics like fluids is faster with network inference. There are probably a lot of cases where network inference physics is faster.
- twic 9 months ago
  Do you think that inference on a thirteen million parameter neural network is more lightweight than running a conventional physics engine?
  procgen 9 months ago
  Convincing liquid physics (e.g. surf interacting with a beach, rocks, the player character) might be a good candidate.
  tiagod 9 months ago
  In some cases, the model will be lighter. There is no need for 14M parameters for physics simulations, and there's a lot of promising work in that area.
  epolanski 9 months ago
  Every software that can be implemented in a JavaScript, ehm, LLM, will eventually be implemented in an LLM.
  kendalf89 9 months ago
  Are you predicting node.llm right now?
- hobs 9 months ago
  A physics bug would be a consistent problem you can fix. There's no such guarantee about an ML model. This would likely only be ok in the context of a game specifically made to be janky.
  fullstackwife 9 months ago
  This is one of the fallacies of current AI research space: they don't focus on the end-user too much. In this case the end-user would be the gamer, and while playing games you expect a valid gameplay, so those kind of hallucinations are not acceptable, while I'm pretty sure they give the AI research authors a strong dopamine trigger. We have a hammer and now we are looking for a nail, while you should ask a question first: what is the problem we are trying to solve here?
  Real world usage will be probably different, and maybe even unexpected by the authors of this research.
  jsheard 9 months ago
  > This is one of the fallacies of current AI research space: they don't focus on the end-user too much. In this case the end-user would be the gamer
  Or from another angle the end-user is a game developer trying to actually work with this kind of technology, which is just a nightmarish prospect. Nobody in the industry is asking for a game engine that runs entirely on vibes and dream logic, gamedev is already chaotic enough when everything is laid out in explicit code and data.
  stale2002 9 months ago
  > they don't focus on the end-user too much.
  Of course they don't. Stuff like this is a proof of concept.
  If they had a product that worked, they wouldn't be in academia. Instead, they would leave the world of research and create a multi billion dollar company.
  Almost by definition, anything in academia isn't going to be productized, because if it was, then the researchers would just stop researching and make a bunch of money selling the product to consumers.
  Such research is still useful for society, though, as it means that someone else can spend the millions and millions of dollars making a better version and then selling that.
  badpun 9 months ago
  The whole purpose of academia is literally to nerd out on cool, impractical things, which will ocasionally turn out to have some real-life relevance years or decades later. This (hallucinated CS) is still more relevant to real world than 99% of what happens in academic research.
  dartos 9 months ago
  Yes to the first part, no to the random “99% useless” number you made up.
  I’m no fan of academia, but it undeniably produces useful and meaningful knowledge regularly.
  badpun 9 months ago
  You made up the "99% useless" straw man, I never used the word useless in my post. I merely said that the AI research is more relevant than 99% of research, which includes things like gender studies, studies of Rennaisance German theatre, discussions of which exact villages were pillaged and in what order during the march of some army 300 years ago etc. etc.
  kqr 9 months ago
  This obsession people have with determinism! I'd much rather take a low rate of weird bugs than common consistent ones. I don't believe reproducibility of bugs makes for better gameplay generally.
  paulryanrogers 9 months ago
  Reproducibility does make bugs more likely to be fixed, or at least fixable.
  Also, games introduce randomness in a controlled way so users don't get frustrated by it appearing in unexpected places. I don't want characters to randomly appear and disappear. It's fine if bullet trajectory varies more randomly as they get further away.
  skydhash 9 months ago
  Also most engines have been worked on for years. So more often than not, core elements like audio, physics, input,... are very stable and the remaining bugs are either "can't fix" or "won't fix".
  NotMichaelBay 9 months ago
  It might be fine for casual players, but it would prevent serious and pro players from getting into the game. In Counter-Strike, for example, pro players (and other serious players) practice specific grenade throws so they can use them reliably in matches.
  kqr 9 months ago
  I'm not saying one can make specifically Counter-Strike on a non-deterministic engine -- that seems like strawmanning my argument.
  People play and enjoy many games with varying levels of randomness as a fundamental component, some even professionally (poker, stock market). This could be made such a game.
  monsieurbanana 9 months ago
  Either the physics engine matter, in which case you want a deterministic engine as you said, or it doesn't like in a poker game and you don't want to spend much resources (manpower, computer cycles) into it.
  Which also means an off-the-shelf deterministic engine.
  mrob 9 months ago
  The whole hobby of speedrunning relies heavily on exploiting deterministic game bugs.
  hobs 9 months ago
  Make a fun game with this as a premise and I will try it, but it sounds just an annoying concept.
  FridgeSeal 9 months ago
  Gosh can you imagine how frustrating a game would be when you can _barely_ do anything reliably?
  Wanted to jump on the platform? Wow too bad, model had a little day-dream and now you’re floating somewhere. Wanted to peak a corner? Oops too bad, model had a moment and now the wall curves differently. Trying to pick a shot-well you’d definitely hit the other player if the model wasn’t making physics go out the window. Inconsistently.
  dartos 9 months ago
  You don’t play a lot of games, huh?
  Consistent bugs you can anticipate and play/work around, random ones you can’t. Just look at pretty much any speed running community for games before 1995.
  Say goodbye to any real competitive scene with random unfixable potentially one off bugs.
  siev 9 months ago
  [dead]
- Thorrez 9 months ago
  Would that work for multiplayer? If it's a visual effect only, I guess it would be ok. But if it affects gameplay, wouldn't different players get different results?
  killerstorm 9 months ago
  Well, it doesn't make sense to use this exact model - this is just demonstration that it can learn world model from pixels.
  An obvious next step towards a more playable game is to add state vector to the inputs of the model: it is easier to learn to render the world from pixels + state vectors than from pixels alone.
  Then it depends what we want to do. If we want normal Counter Strike gameplay but with new graphics, we can keep existing CS game server and train only the rendering part.
  If you want to make Dream-Counter-Strike where rules are more bendable then you might want to train state update model...
- crazygringo 9 months ago
  Yeah, I definitely wouldn't trust it to replace basic physics of running, jumping, bullets, objects shattering, etc.
  But it seems extremely promising for fiery explosions, smoke, and especially water. Anything with dynamics that are essentially complex.
  Also for lighting -- both to get things like skin right with subsurface scattering, as well as global ray-traced lighting.
  You can train specific lightweight models for these things, and they important thing is that their output is correct at the macro level. E.g., a tree should be casting a shadow that looks like the right shadow at the right angle for that type of tree and its types of leaves and general shape. Nobody cares if each individual leaf shadow corresponds to an individual leaf 10 feet above or is just hallucinated.
  BriggyDwiggs42 9 months ago
  I was thinking it would be especially effective for situations where one needed to predict the behavior of a chaotic system, not so much accurately but convincingly. Turbulence flow would be an irreducible system right?
- amelius 9 months ago
  > boom, now you have a lightweight physics engine
  lightweight, but producing several hundred watts of heat.
  marcyb5st 9 months ago
  Not true. If you just need a model to learn the physics and not all the pixels I am sure you can get a much smaller model.
  If you look at the metnet papers similar approaches have been successful for weather prediction where models have better results and are order of magnitude faster at inference time compared to numerical models. Of course, training will be time consuming and resources intensive, but it's something you can do once and then ship it to users when they install / update the game.
- slashdave 9 months ago
  Define "lightweight".
croo 9 months ago
For anyone who actually tried it :
Does it respects/builds some kind of game map in the process or is it just a bizarre psychedelic dream walk experience where you cannot go back the same place twice and space dimensions are just funny? Is a game map finite?
- InsideOutSanta 9 months ago
  Just looking at the first video, there's a section where structures just suddenly appear in front of the player, so this does not appear to build any kind of map, or have any kind of meaningful awareness of something resembling a game state.
  This is similar to LLM-based RPGs I've played, where you can pick up a sword and put it in your empty bag, and then pull out a loaf of bread and eat it.
  anal_reactor 9 months ago
  > you can pick up a sword and put it in your empty bag, and then pull out a loaf of bread and eat it
  Mondays
- aidos 9 months ago
  Just skimmed the article but my guess is that it’s a dream type experience where if you turned around 180 and walked the other direction it wouldn’t correspond to where you just came from. More like an infinite map.
  lopuhin 9 months ago
  I don't think so, what they show on CS video is exactly the Dust2 map, not just something similar/inspired by it.
  twic 9 months ago
  It's trained on moving around dust2, so as long as the previous frame was a view of dust2, the next frame is very likely to be a plausible subsequent view of dust2. In some sense, this encodes a map; but it's not what most people think of when they think about maps.
  I'd be interested to see what happens if you look down at your feet for a while, then back up. If the ground looks the same everywhere, do you come up in a random place?
  bohadi 9 months ago
  tried this irl and wound up here
  arendtio 9 months ago
  It probably depends on what you see. As long as you have a broad view over a part of the map, you should stay in that region, but I guess that if you look at a mono-color wall, you probably find yourself in a very different part of the map when you look around yourself again.
  But I am just guessing, and I haven't tried it yet.
- delusional 9 months ago
  Just tried it out, and no. It doesn't have any sort of "map" awareness. It's very much in the "recall/replay" category of "AI" where it seems to accurately recall stuff that is part of the training dataset, but as soon as you do something not in there (like walk into a wall), it completely freaks out and spits out gibberish. Plausible gibberish, but gibberish none the less.
  neongreen 9 months ago
  Can you upload a screen recording? I don’t think I can run the model locally but it’d be super interesting to see what happens if you run into a wall
  kqr 9 months ago
  This should mainly be a matter of giving it more training though, right? It sounds like to amount of training it's gotten is relatively sparse.
  treyd 9 months ago
  It doesn't have any ability to reason about what you did more than a couple of seconds ago. Its memory is what's currently on the screen and what the user's last few inputs were.
  delusional 9 months ago
  Theoretically. In practice, that's not clear. As you add more training data you have to ask yourself what the point is. we already have a pretty good simulation of Counter Strike.
mk_stjames 9 months ago
This was Schmidhuber's group is 2018:
https://worldmodels.github.io/
Just want to point that out.
- afh1 9 months ago
  Ahead of its time for sure. Dream is an accurate term here, that driving scene does resemble driving in dreams.
- hervature 9 months ago
  I assume you are pointing this out because it is the first reference in the paper and getting the recognition it deserves and you are simply providing this link for convenience to those who do not go to the references.
  mk_stjames 9 months ago
  Yes, it was very nice to see it was the first citation in the paper (and cited several times throughout).
  The World Models paper is still one of the most amazing papers I've ever read. And I just really keep wanting to show that, in case people really don't see that, many in-the-know... knew.
- Grimblewald 8 months ago
  that was fun to play, gets incredibly unstable if you survive too long.
jmchambers 9 months ago
I _think_ I understand the basic premise behind stable diffusion, i.e., reverse the denoising process to generate realistic images but, as far as I know, this is always done at the pixel level. Is there any research attempting to do this at the 3D asset level, i.e., subbing in game engine assets (with position and orientation) until a plausible scene is recreated? If it were possible to do it that way, couldn't it "dream" up real maps, with real physics, and so avoid the somewhat noisy output these types of demo generate?
- desdenova 9 months ago
  I think the closest we have right now is 3D gaussian splatting.
  So far it's only been used to train a scene from photographs from multiple angles and rebuild it volumetrically by adjusting densities in a point-cloud.
  But it might be possible to train a model on multiple different scenes, and perform diffusion on a random point cloud to generate new scenes.
  Rendering a point cloud in real time is also very efficient, so it could be used to create insanely realistic game worlds instead of polygonal geometry.
  It seems someone already thought of that: https://ar5iv.labs.arxiv.org/html/2311.11221
  jmchambers 9 months ago
  Interesting, I guess that takes things even further and removes the need for hand-crafted 3D assets altogether, which is probably how things will end up going in gaming, long-term.
  I was suggesting a more modest approach, I guess, one where the reverse-denoising process involves picking and placing existing 3D assets, e.g., those in GTA 5, so that the process is actually building a plausible map, using those 3D assets, but on the fly...
  Turn your car right and a plausible street decorated with buildings, trees and people is dreamt up by the algorithm. All the lighting and physics would still be done in-engine, with stable diffusion acting as a dynamic map creator, with an inherent knowledge of how to decorate a street with a plausible mix of assets.
  I suppose it could form the basis of a procedurally generated game world where, given the same random seed, it could generate whole cities or landscapes that would be the same on each player's machine. Just an idea...
  skydhash 9 months ago
  The thing is that, there are generators that can do exactly this, no need to have an LLM as the middle man. Things like terrain generation, city generation, crowd control, character generation, can be done quite easily with far less compute and energy.
  echelon 9 months ago
  Someone has to write those by hand, and they don't generalize.
  Diffusion based generators will do everything soon. And in every style imaginable.
  We'll probably solve the energy issue in time.
  magicalhippo 9 months ago
  Technically I guess one could do a stable diffusion-like model except on voxels, where instead of pixel intensity values it producing a scalar field which you could turn into geometry using marching cubes or something similar.
  Not sure how efficient that would be though, and would only work for assets like teapots and whatnot, not whole game maps say.
  desdenova 9 months ago
  That's a simplified version of what a point cloud stores, but only works with cubes then.
  A point cloud is basically a 3D texture of colors and densities, so a raymarching algorithm can traverse it adding densities it collides with to find the final fragment color. That's how realistic fog and clouds are rendered in games nowadays, and it's very fast, except they use a noise function instead of a scene model.
  magicalhippo 9 months ago
  > A point cloud is basically a 3D texture of colors and densities
  That's not how I'm familiar with it. As I know it[1], a point cloud is literally that, a collection of individual points, that represents an object scene.
  While what you describe is like the scalar field[2] I mentioned, each position in space has some value. You can render them directly like you say, I was thinking to extract geometry a level-set method could be interesting.
  [1]: https://en.wikipedia.org/wiki/Point_cloud
  [2]: https://en.wikipedia.org/wiki/Scalar_field
- furyofantares 9 months ago
  > but, as far as I know, this is always done at the pixel level
  Image models are NOT denoised at the pixel level - diffusion happens in latent space. This was one of the big breakthroughs that made all of this work well.
  There's a model for encoding/decoding between pixels and latent space. Latent space is able to encode whatever concepts it needs in whichever of its dimensions it needs, and is generally lower dimensional than pixel space. So we get a noisy latent space, denoise it using the diffusion model, then use the other model (variational autoencoder) to decode into pixel space.
- jampekka 9 months ago
  Not exactly 3D assets, but diffusion modems are used to generate e.g. traffic (vehicle trajectories) for evaluating autonomous vehicle algorithms. These vehicles tend to crash quite a lot.
  For example https://github.com/NVlabs/CTG
  Edit: fixed link
- tiborsaas 9 months ago
  Generating this at pixel level is the next level thing. The reverse engineering method your described is probably appealing because it's easier to understand.
  Focusing on pixel level generation is the right approach I think. The somewhat noisy output will be improved upon probably in a short timeframe. Now that they proved with Doom (https://gamengen.github.io/) and this that it's possible, probably more research is happening currently to nail the correct architecture to scale this to HD and minimal hallucination. It happened with videos alredy so we should see a similar level breakthrough soon.
- gliptic 9 months ago
  > I _think_ I understand the basic premise behind stable diffusion, i.e., reverse the denoising process to generate realistic images but, as far as I know, this is always done at the pixel level.
  It's typically not done at the pixel level, but at the "latent space" level of e.g. a VAE. The image generation is done in this space, which has fewer outputs than the pixels of the final image, and then converted to the pixels using the VAE.
  jmchambers 9 months ago
  Frantically Googles VAE...
  Ah, okay, so the work is done at a different level of abstraction, didn't know that. But I guess it's still a pixel-related abstraction, and it is converted back to pixels to generate the final image?
  I suppose in my proposed (and probably implausible) algorithm, that different level of abstraction might be loosely analogous to collections of related game engine assets that are often used together, so that the denoising algorithm might be effectively saying things like "we'll put some building-related assets here-ish, and some park-related flora assets over here...", and then that gets crystallised in to actual placement of individual assets in the post-processing step.
  StevenWaterman 9 months ago
  (High level, specifics are definitely wrong here)
  The VAE isn't really pixel-level, it's semantic-level. The most significant bits in the encoding are like "how light or dark is the image" and then towards the other end bits represent more niche things like "if it's an image of a person, make them wear glasses". This is way more efficient than using raw pixels because it's so heavily compressed, there's less data. This was one of the big breakthroughs of stable diffusion compared to previous efforts like disco diffusion that work on the pixel level.
  The VAE encodes and decodes images automatically. It's not something that's written, it's trained to understand the semantics of the images in the same way other neural nets are.
- slashdave 9 months ago
  Stable diffusion is in latent space, not by pixel.
cousin_it 9 months ago
I continue to be puzzled by people who don't notice the "noise of hell" in NN pictures and videos. To me it's always recognizable and terrifying, has been from the start.
- npteljes 9 months ago
  What do you mean by noise of hell in particular? I do notice that the images are almost always uncanny in a way, but maybe we're not meaning the same thing. Could you elaborate on what you experience?
- taneq 9 months ago
  Like a subtle but unsettling babble/hubbub/cacophony? If so then I think I kind of know what you mean.
  TechDebtDevin 9 months ago
  There's definately a bit of an uncanny valley in the land of top tier diffusion models. A generative video of someone smiling is way more likely to illicit this response for me than a generative image or single frame. It definately has something to do with the movement.
  cousin_it 9 months ago
  Yes, that's exactly it.
- HKH2 9 months ago
  Eyes have a lot of noise too.
delusional 9 months ago
I just checked it out right quick. It works perfectly well on an AMD card with ROCM pytorch.
It seems decent in short bursts. As it goes on it quite quickly loses detail and the weapon has a tendency to devolve into colorful garbage. I would also like to point out that none of the videos show what happens when you walk into a wall. It doesn't handle it very gracefully.
DrSiemer 9 months ago
Where it gets really interesting is if we can train a model on the latest GTA, plus maybe related real life footage, and then use it to live upgrade the visuals of an old game like Vice City.
The lack of temporal consistency will still make it feel pretty dreamlike, but it won't matter that much, because the base is consistent and it will look amazing.
- InsideOutSanta 9 months ago
  Just redrawing images drawn by an existing game engine works, and generates amazing results, although like you point out, temporal consistency is not great. It might interpret the low-res green pixels on a far-away mountain as fruit trees in one frame, and as pines in the next.
  Here's a demo from 2021 doing something like that: https://www.youtube.com/watch?v=3rYosbwXm1w
- davedx 9 months ago
  A game like GTA has way too much functionality and complex branching for this to work I think (beyond eg doing aimless drives around the city — which would be very cool though)
  DrSiemer 9 months ago
  Gta 5 has everything Vice City has and more. In the Doom AI dream it's possible to shoot people. Maybe in this CS model as well?
  I think the model does not have to know anything about the functionality. It can just dream up what is most probable to happen based on the training data.
- empath75 9 months ago
  People focusing on the use of this in video games baffles me. The point isn't that it can regenerate a videogame world, the point is that it can simulate the _real world_. They're using video game footage to train it because it's cheap and easy to synthesize the data they need. This system doesn't know it's simulating a game. You can give it thousands or millions of hours of real world footage and agent input and get a simulation of the real world.
  Tostino 9 months ago
  Reading this thread has been extremely frustrating because so many commenters seem to be stuck on "why do this when we already have a CS:GO simulation at home?".
  You hit the nail on the head.
- sorenjan 9 months ago
  In addition to the sibling comment's older example there's new work done with GTA too.
  https://www.reddit.com/r/aivideo/comments/1fx6zdr/gta_iv_wit...
  DrSiemer 9 months ago
  Cool! Looks fairly consistent as well.
  I wonder if this type of AI upscaling could eventually also fix things like slightly janky animations, but I guess that would be pretty hard without predetermined input and some form of look ahead.
  Limiting character motion to only allow correct, natural movement would introduce a strange kind of input lag.
  sorenjan 9 months ago
  Nvidia has done work in AI driven character animation too.
  https://www.nvidia.com/en-us/on-demand/session/gtcspring22-d...
- taneq 9 months ago
  Using it as a visual upgrade is pretty close to what DLSS does so that sounds plausible.
- skydhash 9 months ago
  Why not just creating the assets with higher resolution?
  DrSiemer 9 months ago
  Because that is a lot more work, will only work for a single game, potentially requires more resources to run and will not get you the same level of realism.
  skydhash 9 months ago
  Based on what I know about 3D games, I can not imagine that being more efficient than current methods. Especially if you take the 3D world data into account.
mungoman2 9 months ago
This is getting ridiculous!
Curious, since this is a strong loop old frame + input -> new frame, What happens if a non-CS image is used to start it off? Or a map the model has never seen. Will the model play ball, or will it drift back to known CS maps?
- Arch-TK 9 months ago
  Looks like it only knows Dust 2 since every single "dream" (I'm going to call them that since looking at this stuff feels like dreaming about Dust 2) is of that map only.
ilaksh 9 months ago
I wonder if there is some way to combine this with a language model, or somehow have the language model in the same latent space or something.
Is that was vision-language models already do? Somehow all of the language should be grounded in the world model. For models like Gemini that can answer questions about video, it must have some level of this grounding already.
I don't understand how this stuff works, but compressing everything to one dimension as in a language model for processing seems inefficient. The reason our language is serial is because we can only make one sound at a time.
But suppose the "game" trained on was a structural engineering tool. The user asks about some scenario for a structure and somehow that language is converted to an input visualization of the "game state". Maybe some constraints to be solved for are encoded also somehow as part of that initial state.
Then when it's solved (by an agent trained through reinforcement learning that uses each dreamed game state as input?), the result "game state" is converted somehow back into language and combined with the original user query to provide an answer.
But if I understand properly, the biggest utility of this is that there is a network that understands how the world works, and that part of the network can be utilized for predicting useful actions or maybe answering questions etc. ?
- empath75 9 months ago
  Not everything needs to be a single giant neural network. You could have a bunch of weakly coupled specialized networks sending data back and forth over a normal api.
- LarsDu88 9 months ago
  To combine with a language model simply replace the action vector with a language model latent.
  Alternative as of last year there are now purely diffusion based text decoder models
fancyfredbot 9 months ago
Strangely the paper doesn't seem to give much detail on the cs-go example. Actually the paper explicitly mentions it's limited to discrete control environments. Unless I'm missing something the mouse input for counterstrike isn't discrete and wouldn't work.
I'm not sure why the title says it was trained on 2x4090 either as I can't see this on either the linked page or the paper. The paper mentions a GPU year of 4090 compute was used to train the Atari model.
- c1b 9 months ago
  CSGO model is only 1.5 gb & training took 12 days on a 4090
  https://github.com/eloialonso/diamond/tree/csgo?tab=readme-o...
  fancyfredbot 9 months ago
  Thanks, that's the detail I was looking for on the training. It's amazing results like this can be achieved at such a low costs! I thought this kind of work was out of reach for the GPU poor.
  The part about the continuous control still seems weird to me though. If anyone understands that then very interested to hear more.
akomtu 9 months ago
The current batch of ML models looks a lot like filling in holes in the wall of text, drawings or movies: you erase a part of the wall and tell it to fix it. And it fills in the hole using colors from the nearby walls in the kitchen and similar walls and we watch this in awe thinking it must've figured out the design rules of the kitchen. However what it's really done is it interpolated the gaps with some sort of basic functions, trigonometric polynomials for example, and it used thousands of those. This solution wouldn't occur to us because our limited memory isn't enough for thousands of polynomials: we have to find a compact set of rules or give up entirely. So when these ML models predict the motion of planets, they approximate the Newton's law with a long series of basic functions.
ThouYS 9 months ago
I don't really understand the intuition on why this helps RL. The original game has a lot more detail, why can't it be used directly?
- jampekka 9 months ago
  It is used as a predictive model of the environment for model-based RL. I.e. agents can predict consequences of their actions.
  ThouYS 9 months ago
  Oh, I see. I was somehow under the impression that the simulation was the game the RL agent learns to play (which kinda seemed nonsensical).
- visarga 9 months ago
  It can use the game directly but if you try this with real life robots, then it is better to do neural simulation before performing an action that could result in injury or damage. We don't need to fall with our cars off the road many times to learn to drive on the road because we can imagine the consequences. Same thing here.
- FeepingCreature 9 months ago
  In the real world, you can't just boot up a copy of reality to play out strategies. You need an internal model.
  tourmalinetaco 9 months ago
  So, effectively, these video game models are proof-of-concepts to say “we can make models with extremely accurate predictions using minimal resources”?
  usrusr 9 months ago
  Not sure where you see the "minimal resources" here? But I'd just counter all questions about "why" with the blanket response of "for understanding natural intelligence": the way biology innovates is that it throws everything against the wall and not pick the one thing that sticks as the winner and focus on that mechanism, it keeps the sticky bits and also everything else as long as their cost isn't prohibitive. Symbolic modeling ("this is an object that can fall down"), prediction chains based on visual similarity patterns (this), hardwired reflexes (we tend to not trust anything that looks and moves like a spider or snake) and who knows what else, it's all there, it all runs in parallel, invited or not, and they all influence each other in subtle and less subtle ways. The interaction is not engineered, it's more like crosstalk that's allowed to happen and has more upside than downside, or else evolution would have preferred variations of the setup that have less of the kind of crosstalk in question. But in our quest to understand us, it's super exciting to see candidates for processes that perhaps play some role in our minds, in isolation, no matter if that role is big or small.
  vbezhenar 9 months ago
  May be I'm wrong but my understanding is that you can film some area using, say, dashcams and then generate this kind of neuro model. Then you can train robot to walk in this area with this neuro-model. It can perform billions of training sessions without touching physical world. Alternatively you can somehow perform 3D scan of area, recreate its 3D model and use, say, game engine to simulate, but that probably requires more effort and not necessarily better.
  usrusr 9 months ago
  And the leg motions we sometimes see in sleeping dogs suggest that this is very much a way how having dreams is useful!
  empath75 9 months ago
  Yes, this is exactly right. What they need is a giant dataset of agent data and audio and video from real world locomotion.
shahzaibmushtaq 9 months ago
As I used to play CS 1.6 and CS: GO in my free time before the pandemic, this playable CS diffusion world map has been trained by a noob player for research purposes.
After reading the comments I can assume that if you play outside of the scope it was trained on, the game loses its functionality.
Nevertheless, R&D for a good cause is something we all admire.
- crossroadsguy 9 months ago
  How is the last version CS 2.0 (I think)? It’s been free to play like GO I guess. Is it like GO where physics felt too dramatised (could just be my opinion)? Or realistic in a snappy way like 1.6?
  shahzaibmushtaq 9 months ago
  Honestly, I heard about CS 2.0 from you. And you are right what you just said about GO.
  crossroadsguy 9 months ago
  Yeah I also didn't know about it until two days back when I was discussing CS with a college friend when we checked it out we realised there is something new after CSGO as well. And of course "proudly Windows only".
  shahzaibmushtaq 9 months ago
  For Windows and Linux, and still grossed around $2 billion according to various reports and sources
  crossroadsguy 9 months ago
  Yeah. I wish 1.6 worked on Apple Silicon and I wish there was an active scene of that still.
thenthenthen 9 months ago
When my game starts to look like this, I know it is time to quit hahha, maybe a helpful tool in gaming addiction therapy? The morphing of the gun/skins and the environment (the sandbags) wow. Would like to play this and see what happens when you walk backwards, turn around quick, use ‘noclip’ :D
Zealotux 9 months ago
Could we imagine parts of game elements to become "targets" for models? For example hair and fur physics have been notoriously difficult to nail, but it should be easier to use AI to simulate some fake physics on top of the rendered frame, right? Is anyone working on that?
LarsDu88 9 months ago
Iterative denoising diffusion is such a hurdle for getting this sort of thing running at reasonable fps
advael 9 months ago
Dang this is the first paper I've seen in a while that makes me think I need new GPUs
madaxe_again 9 months ago
I earnestly think this is where all gaming will go in the next five years - it’s going to be so compelling that stuff already under development will likely see a shift to using diffusion models. As this is demonstrating, a sufficiently honed model can produce realtime graphics - and some of the demos floating around where people are running GTA San Andreas through non-realtime models hint as to where this will go.
I give it the same five years before there are games entirely indistinguishable from reality, and I don’t just mean graphical fidelity - there’s no reason that the same or another model couldn’t provide limitless physics - bust a hole through that wall, set fire to this refrigerator, whatever.
- qayxc 9 months ago
  I think you're missing the most important point: these models need to be trained on something and that something is a fully developed, working game.
  You're basically saying that game development would need to do the work twice: step 1: develop a fully functional game, step 2: spend ridiculous effort (in terms of time and compute) on training a model to emulate the game in a half-baked fashion.
  It's a solution looking for a problem.
  manmal 9 months ago
  The world model can still be rendered in very low res, and then the diffusion skin/remaster is applied.
  And this would also be an exciting route to go at remastering old games. I‘d pay a lot to play NFS Porsche again, with photorealism. Or imagine Command & Conquer Red Alert, „rendered“ with such a model.
  qayxc 9 months ago
  NVIDIA's RTX Remix [1] suite of tools already does that. It doesn't require any model training or dozens of hours of pre-recorded gameplay either.
  You can drop in low-res textures and have AI tools upscale them. Models can be replaced, as well as lighting and the best part: it's all under your control. You're not at the merci of obscure training material that might or might not result in a consistent look-and-feel. More knobs, more control, less compute required.
  [1] https://www.nvidia.com/en-us/geforce/rtx-remix/
  manmal 9 months ago
  TIL, thanks for posting. The workflow I was sketching out is simpler though: Render a legacy game or low fidelity modern game as-is, and run it through a diffusion model in real time.
  empath75 9 months ago
  No, you use it to simulate things that we don't have efficient perfect models of -- like the actual world. Everyone is correct that using this to simulate counterstrike is pointless. This is not video game technology, this is autonomous agent technology -- training robots to predict and navigate the real world.
  FeepingCreature 9 months ago
  You can crosstrain on reality.
  qayxc 9 months ago
  Sure - and that'll work great on titles like Ratchet & Clank [1] or Tiny Tina's Wonderland [2] because art and style are dead and everything must be a mirror-like reflection of reality in order to be a fun game...
  [1] https://en.wikipedia.org/wiki/Ratchet_%26_Clank
  [2]https://playwonderlands.2k.com
  FeepingCreature 9 months ago
  The network learns high-level abstractions, not just style. Crosstraining on reality will make it better at physics, logic, buildings, expressions, character animations etc. - even in a cel shading or topdown style.
  It's the thing Miyazaki is on about with his famous quote.
  undefined 9 months ago
  [deleted]
- casenmgreen 9 months ago
  Not a chance.
  There are fundamental limitations with what are in the end all essentially neural nets; there is no understanding, only prediction. Prediction alone is not enough to emulate reality, which is why for example genuinely self-driving cars have not, and will not, emerge. A fundamental advance in AI technology will be required for that, something which leads to genuine intelligence, and we are no closer to that than ever we were.
  fancyfredbot 9 months ago
  Looking at the examples of 2600 games in the paper I'm not sure you can tell that they are just predictions.
  Have you considered how you'd tell the difference between a prediction and understanding in practice?
  jiggawatts 9 months ago
  For simulations like games, it's a trivial matter to feed the neural game engine pixel-perfect metadata.
  Instead of rendering the final shaded and textured pixels, the engine would output just the material IDs, motion vectors, and similar "meta" data that would normally be the inputs into a real-time shader.
  The AI can use this as inputs to render a photorealistic output. It can be trained using offline-rendered "ground-truth" raytraced scenes. Potentially, video labelled in a similar way could be used to give it a flair of realism.
  This is already what NVIDIA DLSS and similar AI upscaling tech uses. The obvious next step is not just to upscale rendered scenes, but to do the rendering itself.
  madaxe_again 9 months ago
  Yet we have no understanding, only prediction. We can describe a great many things in detail, how they interact - and we can claim to understand things, yet if you recursively ask “why?” everybody, and I mean everybody, will reach a point where they say “I don’t know” or “god”.
  An incomplete understanding is no understanding at all, and I would argue that we can only predict, and we can certainly emulate reality, otherwise we would not be able to function within it. A toddler can emulate reality, anticipate causality - and they certainly can’t be said to be in possession of a robust grand unified theory.
  esafak 9 months ago
  Ilya Sutskever, among others, believe that better prediction is achieved through understanding, and that models can be learn to understand. As we have. We are not born understanding everything.
  francoisfleuret 9 months ago
  "there is no understanding, only prediction"
  I have no idea what this means.
  nonrandomstring 9 months ago
  > > "there is no understanding, only prediction"
  > I have no idea what this means.
  You can throw a ball up in the air and predict that it will fall again and bounce. You have no understanding of mass, gravity, acceleration, momentum, impulse, elasticity...
  You can press a button that makes an Uber car appear in reality and take you home. You have no understanding of apps, operating systems, radio, internet, roads, wheels, internal combustion engines, driving, GPS, maps...
  This confusion of understanding and prediction affects a lot of people who use technology in a "machine-like" way, purely instrumental and utilitarian... "how does this get me what I want immediately?"
  You can take any complex reality and deflate it, abstract it, reduce it down to a mere set of predictions that preserve all the utility for a narrow task (in this case visual facsimile) but strip away all depth of meaning. The models, of both the system and the internal working model of the user are flattened. In this sense "AI" is probably the greatest assault on actual knowledge since the book burning under totalitarian regimes of the mid 20th century.
  GaggiX 9 months ago
  What if the model actually understands that the ball will fall and bounce because of mass, gravity, acceleration, momentum, impulse, elasticity? I mean you can just ask ChatGPT and Claude, I guess you would answer that in this case it's just prediction, but if they were human then it would be understanding.
  nonrandomstring 9 months ago
  > I guess you would answer that in this case it's just prediction,
  No I would answer that it is indeed understanding, to upend your "guess" (prediction) and so prove that while you think you can "predict" the next answer you lack understanding of what the argument is really about :)
  GaggiX 9 months ago
  I think I understand the topic quite well, since you deliberately deviate from answering the question. You made a practical example that doesn't really work in practice.
  binary132 9 months ago
  I think GP is saying that understanding is measured by predictive capability of the theory
  and in case you hadn’t noticed, that kind of uncomprehending slopthink has been going on for a lot longer than the AI fad
  tourmalinetaco 9 months ago
  The MLM has no idea what it’s making, where you are in the map, what you left behind, and what you picked up. It can accurately predict what comes next, but if you pick up an item and do a 360° turn the item will be back and you can repeat the process.
  GaggiX 9 months ago
  When a human does it, it's understanding, when an AI does it, it's prediction, I thinks it's very clear /s
  therouwboat 9 months ago
  Does what? In normal game world things tend to stay where they are without player having to do anything.
  GaggiX 9 months ago
  We are talking about neural networks in general, not this one or that one, if you train a bad model or the model is untrained it would not indeed understand much or anything.
  killerstorm 9 months ago
  That's bs. You have no understanding of understanding.
  Hooke's law was pure curve-fitting. Hooke definitely did not understand the "why". And yet we don't consider that bad physics.
  Newton's laws can be derived from curve fitting. How is that different from "understanding"?
  madaxe_again 9 months ago
  Einstein couldn’t even explain why general relativity occurred. Sure, spacetime is curved by mass, but why? What a loser.
  killerstorm 9 months ago
  It's very illustrative to look into the history of discovery of laws of motion, as it's quite well documented.
  People have an intuitive understanding of motion - we see it literally every day, we throw objects, etc.
  And yet it took literally thousands of years since discovery of mathematics (geometry, etc.) to formulate a concept of force, momentum, etc.
  Ancient Greek mathematicians could do integration, so they were not lacking mathematical sophistication. And yet their understanding of motion was so primitive:
  Aristotle, an extremely smart man, was muttering something about "violent" and "natural" motion: https://en.wikipedia.org/wiki/Newton%27s_laws_of_motion#Anti...
  People started to understand the conservation of quantity of motion only in 17th century.
  So we have two possibilities:
  * everyone until 17th century was dumb af (despite being able to do quite impressive calculations)
  * scientific discovery is really a heuristic-driven search process where people try various things until they find a good fit
  I.e. millions of people were somehow failing to understand motion for literally thousands of years until they collected enough assertions about motion that they were able to formulate the rule of conservation, test it, and confirm it fits. And only then it became understanding.
  You can literally see conservation of momentum on a billiard table: you "violently" hit one ball, it hits other balls and they start to move, but slower, etc. So you really transfer something from one ball to the rest. And yet people could not see it for thousands of years.
  What this shows is that there's nothing fundamental about understanding: it's just a sense of familiarity, it is a sense that your model fits well. Under the hood it's all prediction and curve fitting.
  We literally have prediction hardware in our brains: cerebellum has specialized cells which can predict, e.g. motion. So people with damaged cerebellum have impaired movement: they still can move, but their movement are not precise. When do you think we find specialized understanding cells in the human brain?
  mrob 9 months ago
  It seems to me that your evidence supports the exact opposite of your conclusion. Familiarity was only enough to find ad-hoc heuristics for specific situations. It let us discover intuitive methods to throw stones, drive carts, play ball games, etc. but never discovered the general principle behind them. A skilled archer does not automatically know that the same rules can be used to aim a mortar.
  Ad-hoc heuristics are not the same thing as understanding. It took formal reasoning for humans to actually understand motion, of a type that modern AI does not use. There is something fundamental about understanding that no amount of familiarity can substitute for. Modern AI can gain enormous amounts of familiarity but still fail to understand, e.g. this Counter-Strike simulator not knowing what happens when the player walks into a wall.
  killerstorm 9 months ago
  People found that `m * v` is the quantity which is conserved.
  There's no understanding. It's just a formula which matches the observations. It also matches our intuition (a heavier object is hard to move, etc), and you feel this connection as understanding.
  Centuries later people found that conservation laws are linked to symmetries. But again, it's not some fundamental truth, it's just a link between two concepts.
  LLM can link two concepts too. So why do you believe that LLM cannot understand?
  I middle school I did extremely well in physics classes - I could solve complex problems which my classmates couldn't because I could visualize the physical process (e.g. motion of an object) and link that to formulas. This means I understood it, right?
  Years later I thought "But what *is* motion, fundamentally?". I grabbed Landau-Lifshitz mechanics textbook. How do they define motion? Apparently, bodies move in a way to minimize some integral. They can derive the rest from it. But it doesn't explain what a motion is. Some of the best physicists in the world cannot define it.
  So I don't think there's anything to understanding except feeling of connection between different things. "X is like Y except for Z".
  mrob 9 months ago
  Understanding is finding the simplest general solution. Newton's laws are understanding. Catching the ball is not. LLMs take billions of parameters to do anything and don't even generalize well. That's obviously not understanding.
  killerstorm 9 months ago
  You're confusing two meanings of world "understanding":
  1. Finding a comprehensive explanation
  2. Having a comprehensive explanation which is usable
  99.999% people on Earth do not discover any new laws, so I don't think you use #1 as a fundamental deficiency of LLMs.
  And nobody is saying that just training a LLM produces understanding of new phenomena. That's a strawman.
  The thesis is that a more powerful LLM together with more software, more models, etc, can potentially discover something new. That's not observed yet. But I'd say it would be weird if LLM can match capabilities of average folk but never match Newton. It's not like Newton's brain is fundamentally different.
  Also worth noting that formulas can be discovered by enumeration. E.g. `m * v` should not be particularly hard to discover. And the fact that it took people centuries implies that that's what happened: people tried different formulas until they found one which works. It doesn't have to be some fancy Newton magic.
  mrob 9 months ago
  I'm certain that people did not spend centuries trying different formulas for the laws of motion before finding one that worked. The crucial insight was applying any formula at all. Once you have that then the rest is relatively easy. I don't see LLMs making that kind of discovery.
  undefined 9 months ago
  [deleted]
- TinkersW 9 months ago
  It requires a monster GPU to run 10 fps at what looks like sub 720p... I think it may be abit more than 5 years..
- viraptor 9 months ago
  It's not that great yet.
  Given a model which can generate the game view in ~real time and a model which can generate the models and textures, why would you ever use the first option, apart from a cool tech demo? I'm sure there's space for new dreamy games where invisible space behind you transforms when you turn around, but for other genres... why? Destructible environment has been possible for quite a while, but once you allow that everywhere, you can get games into unplayable state. They need to be designed around that mechanic to work well: Noita, Worms, Teardown, etc. I don't believe the "limitless physics" would matter after a few minutes.
- Arch485 9 months ago
  It seems extremely unlikely to me that ML models will ever run entire games. Nobody wants a game that's "entirely indistinguishable from reality" anyways. If they did, they would go outside.
  I think it's possible specific engine components could be ML-driven in the future, like graphics or NPC interactions. This is already happening to a certain degree.
  Now, I don't think it's impossible for an ML model to run an entire game. I just don't think making + running your game in a predictive ML model will ever be more effective than making a game the normal way.
  jsheard 9 months ago
  Yep, the fuzziness and opaqueness of ML models makes developing an entire game state inside one a non-starter in my opinion. You need precise rules, and you need to be able to iterate on those rules quickly, neither of which are feasible with our current understanding of ML models. Nobody wants a version of CS:GO where fundamental constants like weapon damage run on dream logic.
  If ML has any place in games it's for specific subsystems which don't need absolute precision, NPC behaviour, character animation, refining the output of a renderer, that kind of thing.
- advael 9 months ago
  I'm not sure that's a warranted assumption based on this result, exciting as it is, we are still seeing replication of an extant testable world model, rather than extrapolation that can produce novel mechanics without them being in the training data. I'm not saying this isn't a stepping stone to that, I just think your prediction's a little optimistic based on the scope of that problem
undefined 9 months ago
[deleted]
gadders 9 months ago
Cool achievement, but I want AI to give me smarter NPCs, not simulate the map.
- thelastparadise 9 months ago
  The NPCs need a model of the world in their brain in order to act normal.
iwontberude 9 months ago
This is crazy looking, I know it’s basically useless but it’s cool anyways.
styfle 9 months ago
But does it work on macOS?
(The latest CS removed support for macOS)
6510 9 months ago
Can it use a seed that makes the same map every time?
mixtureoftakes 9 months ago
this is crazy
when trying to run on a mac it only plays in a very small window, how could this be configured?
snickerer 9 months ago
I see where this is going.
The next step to create training data is a real human with a bodycam. There is only the need to connect the real body movement (step forward, turning left, etc) to typical keyboard and mouse game control events, to feed them into the model, too.
I think that is what the devs here are dreaming about.
- CaptainFever 9 months ago
  Or a cockpit cam for the world's most realistic flight simulator. /lighthearted
- devttyeu 9 months ago
  The "We live in a simulation" argument just started looking a lot more conceivable.
  tiborsaas 9 months ago
  I'm already very suspicious, we just got the same room number in the third hotel in a row. Someone got lazy with the details :)
  iwontberude 9 months ago
  Not really because is it simulators all the way down? Simulation theory explains nothing and only adds more unexplainable phenomenon.
- devttyeu 9 months ago
  Could probably make a decent dataset from VR headset tracking cameras + motion sensors + passthrough output + decoded hand movements
TealMyEal 9 months ago
What's the end goal here? personalised games for everyone. ultra-graphics. i dont really see how this is going to be better than our engine based systems.
I love being a horse in the 1900s that automobilewill never take off /s
- visarga 9 months ago
  The goal is to train agents that can imagine consequences before acting. But it could also become a cheap way to create experiences and user interfaces on the fly, imagine if you can have any app UI dreamed up like that, not just games. Generative visual interfaces could be a big leap over text mode.
- qayxc 9 months ago
  It's a research paper. Not everything that comes out of research has an immediate real-world application in mind.
  Games are just an accessible and easy to replicate context to work in, they're not an end goal or target application.
  The research is about AI agents interacting with and creating world models. Such world models could just as well be alien environments - i.e. the kind of stuff an interstellar and even interplanetary probe would need to be able to do, as two-way communication over large distances is impractical.
yakorevivan 9 months ago
[dead]
siev 9 months ago
[dead]
suggeststrongid 9 months ago
[dead]
w-m 9 months ago
If you're not bored with it yet, here's a Deep Dive (NotebookLM, generated podcast). I fed it the project page, the arXiv paper, the GitHub page, and the two twitter threads by the authors.
https://notebooklm.google.com/notebook/a240cb12-8ca1-41b4-ab... (7m59s)
As always, it's not actually much of a technical deep dive, but gives a quite decent overview of the pieces involved, and its applications.
- thierrydamiba 9 months ago
  How did you get the output to be so long? My podcasts are 3 mins max…
  w-m 9 months ago
  Oh wow, really? Even if you feed it whole research papers? The ones I tried until now were more in the 8-10 minute range. I haven’t looked in to how to control the output yet. Hopefully that’ll get a little more transparent and controllable soon.