Just some observations from an ex autonomous robotics researcher here.
One of the most important differences at least in those days (80's and 90's) was time. While the digital can be sped up just constrained by the speed of your compute, the physical in very constrained by real time physics. You can't speed up a robot 10x in a 10.000 grabbing and stacking learning run without completely changing the dynamics.
Also, parallellizing the work requires more expensive full robots rather than more compute cores. Maybe these days the different ai gym like virtual physics environments offer a (partial) solution to that problem, but I have not used them (yet) so I can't tell.
Furthermore, large scale physical robots are far more fragile due to wear and tear than the incredible resilience of modern compute hardware. Getting a perfect copy of a physical robot and environment is a very hard, near impossible, task.
Observability and replay, while trivial in the digital world, is very limited in the physical environment making analysis much more difficult.
I was both excited and frustrated at the time by making ai do more than rearanging pixels on a 2D surface. Good times were had.
I find it odd that the article doesn't address the apparent success of training with transformer based models in virtual environments to build models that are then mapped onto the real world. This is being used in everything from building datasets for self driving cars, to navigation and task completion for humanoid robots. Nvidia have their omniverse project [1], but there are countless other examples [2][3][4]. Isn't this obviously the way to build the corpus of experience needed to train these kinds of cross modal models?
[1] https://www.nvidia.com/en-us/industries/robotics/#:~:text=NV....
[2] https://www.sciencedirect.com/science/article/abs/pii/S00978...
[3] https://techcrunch.com/2024/01/04/google-outlines-new-method...
[4] https://techxplore.com/news/2024-09-google-deepmind-unveils-...
> Robots are probably amazed by our ability to keep a food tray steady, the same way we are amazed by spider-senses (from spiderman movie)
Funnily, Toby Maguire actually did that tray catching stunt for real. So robots have an even further way to go.
https://screenrant.com/spiderman-sam-raimi-peter-parker-tray...
I think Moravec's Paradox is often misapplied when considering LLMs vs. robotics. It's true that formal reasoning over unambiguous problem representations is easy and computationally cheap. Lisp machines were already doing this sort of thing in the '70s. But the kind of commonsense reasoning over ambiguous natural language that LLMs can do is not easy or computationally cheap. Many early AI researchers thought it would be — that it would just require a bit of elaboration on the formal reasoning stuff — but this was totally wrong.
So, it doesn't make sense to say that what LLMs do is Moravec-easy, and therefore can't be extrapolated to predict near-term progress on Moravec-hard problems like robotics. What LLMs do is, in fact, Moravec-hard. And we should expect that if we've got enough compute to make major progress on one Moravec-hard problem, there's a good chance we're closing in on having enough to make major progress on others.
Leaving aside the lack of consensus around whether LLMs actually succeed in commonsense reasoning, this seems a little bit like saying “Actually, the first 90% of our project took an enormous amount of time, so it must be ‘Pareto-hard’. And thus the last 10% is well within reach!” That is, that Pareto and Moravec were in fact just wrong, and thing A and thing B are equivalently hard.
Keeping the paradox would more logically bring you to the conclusion that LLMs’ massive computational needs and limited capacities imply a commensurately greater, mind-bogglingly large computational requirement for physical aptitude.
Good points. Came here to say pretty much the same.
Moravec's Paradox is certainly interesting and correct if you limit its scope (as you say). But it feels intuitively wrong to me to make any claims about the relative computational demands of sensi-motor control and abstract thinking before we’ve really solved either problem.
Looking e.g. at the recent progress in solving ARC-AGI my impression is that abstract thought could have incredible computational demands. IIRC they had to throw approximately $10k of compute at o3 before it reached human performance. Now compare how cognitively challenging ARC-AGI is to e.g. designing or reorganizing a Tesla gigafactory.
With that said I do agree that our culture tends to value simple office work over skillful practical work. Hopefully the progress in AI/ML will soon correct that wrong.
> Moravec’s paradox is the observation by artificial intelligence and robotics researchers that, contrary to traditional assumptions, reasoning requires very little computation, but sensorimotor and perception skills require enormous computational resources. The principle was articulated by Hans Moravec, Rodney Brooks, Marvin Minsky, and others in the 1980s.
I have a name for it now!
I've said over and over that there are only two really hard problems in robotics: Perception and funding. A perfectly perceived system and world can be trivially planned for and (at least proprio-)controlled. Imagine having a perfect intuition about other actors such that you know their paths (in self driving cars), or your map is a perfect voxel + trajectory + classification. How divine!
It's limited information and difficulties in reducing signal to concise representation that always get ya. This is why the perfect lab demos always fail - there's a corner case not in your training data, or the sensor stuttered or became misaligned, or etc etc.
> I've said over and over that there are only two really hard problems in robotics: Perception and funding. A perfectly perceived system and world can be trivially planned for and (at least proprio-)controlled.
Funding for sure. :)
But as for perception, the inverse is also true. If I have an perfect planning/prediction system, I can throw the grungiest, worst perception data into it and it will still plan successfully despite tons of uncertainty.
And therein lies the real challenge of robotics: It's fundamentally a systems engineering problem. You will never have perfect perception or a perfect planner. So, can you make a perception system that is good enough that, when coupled with your planning system which is good enough, you are able to solve enough problems with enough 9s to make it successful.
The most commercially successful robots I've seen have had some of the smartest systems engineering behind them, such that entire classes of failures were eliminated by being smarter about what you actually need to do to solve the problem and aggressively avoid solving subproblems that aren't absolutely necessary. Only then do you really have a hope of getting good enough at that focused domain to ship something before the money runs out. :)
> being smarter about what you actually need to do to solve the problem and aggressively avoid solving subproblems that aren't absolutely necessary
I feel like this is true for every engineering discipline or maybe even every field that needs to operate in the real world
except software, of course. Nowadays it seems that software is all about creating problems to create solutions for.
Maybe just semantics, but I think I would call that prediction. Even if you have perfect perception (measuring the current state of the world perfectly), it's nontrivial to predict the future paths of other actors. The prediction problem requires intuition about what the other actors are thinking, how their plans influence each other, and how your plan influences them.
> Moravec hypothesized around his paradox, that the reason for the paradox [that things we perceive as easy b/c we dont think about them are actually hard] could be due to the sensor & motor portion of the human brain having had billions of years of experience and natural selection to fine-tune it, while abstract thoughts have had maybe 100 thousand years or less
Another gem!
> ...the sensor & motor portion of the human brain having had billions of years of experience.
It doesn't really change the significance of the quote, but I can't help but point out that we didn't even have nerve cells more than 0.6 billion of years ago.
Or it could be a parallel vs serial compute thing.
Perception tasks involve relatively simple operations across very large amounts of data, which is very easy if you have a lot of parallel processors.
Abstract thought is mostly a serial task, applying very complex operations to a small amount of data. Many abstract tasks like evaluating logical expressions cannot be done in parallel - they are in the complexity class P-complete.
Your brain is mostly a parallel processor (80 billion neurons operating asynchronously), so logical reasoning is hard and perception is easy. Your CPU is mostly a serial processor, so logical reasoning is easy and perception is hard.
> Perception tasks involve relatively simple operations across very large amounts of data, which is very easy if you have a lot of parallel processors.
Yes, relatively simple. Wait, isn't that exactly what the article explained was completely wrong-headed?
No. The article is talking about things we think of as being easy because they are easy for a human to perform but that are actually very difficult to formalize/reproduce artificially.
The person you are responding to is instead comparing differences in biological systems and mechanical systems.
Yeah the fun way Moravec's paradox was explained to me [1] is that you can now easily get a computer to solve simultaneous differential equations governing all the axes of motion of a robot arm but getting it to pick one screw out of a box of screws is an unsolved research problem.
[1] by a disillusioned computer vision phd that left the field in the 1990s.
Selective attention was one of the main factors in Hubert Dreyfus' explanation of "what computers can't do." He had a special term for it, which I can't remember off-hand.
"the sensor stuttered or became misaligned, or etc etc."
if your eyes suddenly crossed, you'd probably fall over too!
It's worth noting that modern multimodal models are not confused by the cat image. For example, Claude 3.5 Sonnet says:
> This image shows two cats cuddling or sleeping together on what appears to be a blue fabric surface, possibly a blanket or bedspread. One cat appears to be black while the other is white with pink ears. They're lying close together, suggesting they're comfortable with each other. The composition is quite sweet and peaceful, capturing a tender moment between these feline companions.
Also Claude, when given the entire picture:
"This is a humorous post showcasing an AI image recognition system making an amusing mistake. The neural network (named "neural net guesses memes") attempted to classify an image with 99.52% confidence that it shows a skunk. However, the image actually shows two cats lying together - one black and one white - whose coloring and positioning resembles the distinctive black and white pattern of a skunk.
The humor comes from the fact that while the AI was very confident (99.52%) in its prediction, it was completely wrong..."
The progress we made in barely ten years is astounding.
It's easy to make something work when the example goes from being out of the training data to into the training data.
Definitely. But I also tried with a picture of an absurdist cartoon drawn by a family member, complete with (carefully) handwritten text, and the analysis was absolutely perfect.
Question:
Isn’t it fundamentally impossible to model a highly entropic system using deterministic methods?
My point is that animal brains are entropic and “designed” to model entropic systems, where as computers are deterministic and actively have to have problems reframed as deterministic so that they can solve them.
All of the issues mentioned in the article boil down to the fundamental problem of trying to get deterministic systems to function in highly entropic environments.
LLMs are working with language, which has some entropy but is fundamentally a low entropy system, and has orders of magnitude less entropy than most peoples’ back garden!
As the saying goes, to someone with a hammer, everything looks like a nail.
So I'm old. PhD on search engines in the early 1990's (yep, early 90s). Learnt AI in the dark days of the 80's. So, there is an awful lotl of forgetting going on, largely driven by the publish-or-perish culture we have. Brooks' subsumption architecture was not perfect, but it outlined an approach that philosophy and others have been championing for decades. He said he was not implementing Heidegger, just doing engineering, but Brooks was certainly channeling Heidegger's successors. Subsumption might not scale, but perhaps that is where ML comes in. On a related point, "generative AI" does sequences (it's glorified auto complete (not) according to Hinton in the New Yorker). Data is given to a Tokeniser that produces a sequence of tokens, and the "AI" predicts what comes next. Cool. Robots are agents in an environment with an Umwelt. Robotics is pre the Tokeniser. What is it the is recognisable and sequential in the world? 2 cents please.
I’m surprised this doesn’t place more emphasis on self-supervised learning through exploration. Is human-labeled datasets really the SOTA approach for robotics?
I would love to see some numbers. How many orders of magnitude more complicated do we think embodiment is, compared to conversation? How much data do we need compared to what we’ve already collected?
"Hardness" is a difficult quantity to define if you venture beyond "humans have been trying to build systems to do this for a while, and haven't succeeded".
Insects have succeed in build precision systems that combine vision, smell, touch and a few other senses. I doubt finding a juicy spider, immobilising it, is that much more difficult that finding a door knob and turning it, or folding a T-Shirt. Yet insects accomplish it with I suspect far less compute than modern LLM's. So it's not "hard" in the sense of requiring huge compute resources, and certainly not a lot of power.
So it's probably not that hard in the sense that it's well within the capabilities of the hardware we have now. The issue is more that we don't have a clue how to do it.
If nature computed both through evolution, then maybe it's approximately the same ratio. So roughly the time it took to evolve embodiment, and roughly the time it took to evolve from grunts to advanced language.
If we start from when we think multicellular life first evolved (~2b years), or maybe the Cambrian explosion (~500m years), and until modern humans (~300k years). Then compare that to the time between first modern humans now now.
It seems like maybe 3-4 orders of magnitude harder.
My intuition after reading the articles is that there needs to be way more sensors all throughout the robot, probably with lots of redundancies, and then lots of modern LLM sized models all dedicated to specific joints and functions and capable of cascading judgement between each other, similar to how our nervous system works.
I feel more tired after driving all day than reading all day.
The reason why it sounds counterintuitive is that neurology has the brain upside down. It teaches us that formal thinking occurs in the neocortex, and we need all that huge brain mass for that.
But in fact it works like an autoencoder, and it reduces sensory inputs into a much smaller latent space, or something very similar to that. This does result in holistic and abstract thinking, but formal analytical thinking doesn't require abstraction to do the math or to follow a method without comprehension. It's a concrete approach that avoids the need for abstraction.
The cerebellum is the statistical machine that gets measured by IQ and other tests.
To further support that, you don't see any particularly elegant motions from non mammal animals. In fact everything else looks quite clumsy, and even birds need to figure out flying by trial and error.
Honestly I'm tired of people who are more focused on 'debunking the hype' than figuring out how to make things work.
Yes, robotics is hard, and it's still hard despite big breakthroughs in other parts of AI like computer vision and NLP. But deep learning is still the most promising avenue for general-purpose robots, and it's hard to imagine a way to handle the open-ended complexity of the real world other than learning.
Just let them cook.
As someone on the sidelines of robotics who generally feels everything getting disrupted and at the precipice of major change, it's really helpful to have a clearer understanding of the actual challenge and how close we are to solving it. Anything that helps me make more accurate predictions will help me make better decisions about what problems I should be trying to solve and what skills I should be trying to develop.
> If you want a more technical, serious (better) post with a solution oriented point to make, I’ll refer you to Eric Jang’s post [1]
Yeah, this was my general impression after a brief, disastrous stretch in robotics after my PhD. Hell, I work in animation now, which is a way easier problem since there are no physical constraints, and we still can’t solve a lot of the problems the OP brings up.
Even stuff like using video misses the point, because so much of our experience is via touch.
Yesterday, I was watching some of the youtube videos on the website of a robotics company https://www.figure.ai that challenges some of the points in this article a bit.
They have a nice robot prototype that (assuming these demos aren't faked) does fairly complicated things. And one of the key features they show case is using OpenAI's AI for the human computer interaction and reasoning.
While these things seem a bit slow, they do get things done. They have a cool demo of the a human interacting with one of the prototypes to ask it what it thinks needs to be done and then asking it do these things. That show cases reasoning, planning, and machine vision. Which are exactly topics that all the big LLM companies are working on.
They appear to be using an agentic approach similar to how LLMs are currently being integrated into other software products. Honestly, it doesn't even look like they are doing much that isn't part of OpenAI's APIs. Which is impressive. I saw speech capabilities, reasoning, visual inputs, function calls, etc. in action. Including the dreaded "thinking" pause where the Robot waits a few seconds for the remote GPUs to do their thing.
This is not about fine motor control but about replacing humans controlling robots with LLMs controlling robots and getting similarly good/ok results. As the article argues, the hardware is actually not perfect but good enough for a lot of tasks if it is controlled by a human. The hardware in this video is nothing special. Multiple companies have similar or better prototypes. Dexterity and balance are alright but probably not best in class. Best in class hardware is not the point of these demos.
Dexterity and real time feedback is less important than the reasoning and classification capabilities people have. The latency just means things go a bit slower. Watching these things shuffle around like an old person that needs to go to the bath room is a bit painful. But getting from A to B seems like a solved problem. A 2 or 3x speedup would be nice. 10x would be impressively fast. 100x would be scary and intimidating to have near you. I don't think that's going to be a challenge long term. Making LLMs faster is an easier problem than making them smarter.
Putting a coffee cup in a coffee machine (one of the demo videos) and then learning to fix it when it misaligns seems like an impressive capability. It compensates for precision and speed with adaptability and reasoning: analyze the camera input, correctly analyze the situation, problem and challenge come up with a plan to perform the task, execute the plan, re-evaluate, adapt, fix. It's a bit clumsy but the end result is coffee. Good demo and I can see how you might make it do all sorts of things that are vaguely useful that way.
The key point here is that knowing that the thing in front of the robot is a coffee cup and a coffee machine and identifying how those things fit together and in what context that is required are all things that LLMs can do.
Better feedback loops and hardware will make this faster, and less tedious to watch. Faster LLMs will help with that too. And better LLMs will result in less mistakes, better plans, etc. It seems both capabilities are improving at an enormously fast pace right now.
And a fine point with human intelligence is that we divide and conquer. Juggling is a lot harder when you start thinking about it. The thinking parts of your brain interferes with the lower level neural circuits involved with juggling. You'll drop the balls. The whole point with juggling is that you need to act faster than you can think. Like LLMs, we're too slow. But we can still learn to juggle. Juggling robots are going to be a thing.
>The key point here is that knowing that the thing in front of the robot is a coffee cup and a coffee machine and identifying how those things fit together and in what context that is required are all things that LLMs can do.
I'm skeptical that any LLM "knows" any such thing. It's a Chinese Room. It's got a probability map that connects the lexeme (to us) 'coffee machine' and 'coffee cup' depending on other inputs that we do not and cannot access, and spits out sentences or images that (often) look right, but that does not equate to any understanding of what it is doing.
As I was writing this, I took chat GPT-4 for a spin. When I ask it about an obscure but once-popular fantasy character from the 70s cold, it admits it doesn't know. But, if I ask it about that same character after first asking about some obscure fantasy RPG characters, it cheerfully confabulates an authoritative and wrong answer. As always, if it does this on topics where I am a domain expert, I consider it absolutely untrustworthy for any topics on which I am not a domain expert. That anyone treats it otherwise seems like a baffling new form of Gell-Mann amnesia.
And for the record, when I asked ChatGPT-4, cold, "What is Gell-Mann amnesia?" it gave a multi-paragraph, broadly accurate description, with the following first paragraph:
"The Gell-Mann amnesia effect is a term coined by physicist Murray Gell-Mann. It refers to the phenomenon where people, particularly those who are knowledgeable in a specific field, read or encounter inaccurate information in the media, but then forget or dismiss it when it pertains to other topics outside their area of expertise. The term highlights the paradox where readers recognize the flaws in reporting when it’s something they are familiar with, yet trust the same source on topics outside their knowledge, even though similar inaccuracies may be present."
Those who are familiar with the term have likely already spotted the problem: "a term coined by physicist Murray Gell-Mann". The term was coined by author Michael Crichton.[1] To paraphrase H.L. Mencken, for every moderately complex question, there is an LLM answer that is clear, simple, and wrong.
1. https://en.wikipedia.org/wiki/Michael_Crichton#Gell-Mann_amn...
It might be nice if the author qualified "most of the freely available data on the internet" with "whether or not it was copyrighted" or something to acknowledge the widespread theft of the works of millions.
Theft is the wrong term, it implies that the original is no longer available. It's copyright infringement at best, and possibly fair use depending on jurisdiction. It wasn't theft when the RIAA went on a lawsuit spree against mp3 copying, and it isn't theft now.