• aithrowawaycomm 5 hours ago

    I didn't think this GitHub Pages write-up was very clear, but the linked paper on arXiv is interesting (haven't finished reading yet!) and this is a cool project.

    Ultimately the weaknesses seem to come from "outsourcing" true spatio-mechanical reasoning to a language model which designs the according constraints, but does so with the same kind of brittle reasoning and odd limitations we've come to expect. It's not really "artificial" spatial reasoning so much as "virtual": sometimes quite good, but paper-thin and largely based on memorizing patterns. I think the authors overstated a few conclusions, e.g. the clothes folding don't appear to be following any strategy at all, let alone a "novel" strategy - whatever apparent hints of strategy the authors are seeing is probably better explained by the symmetry of human clothing, which the vision model picks up on.

    And note they didn't ask the robot to fold messy clothing like a human does when it's fresh out of the dryer; I suspect the robot needs shirts and pants to be laid out neatly, otherwise the vision model will misidentify it.

    More generally, the authors did not do enough to stress-test the robot in situations that don't line up with the training data. It's cool to pour tea from a pot into a mug, but the vision model presumably has thousands of photos of this for the robot to imitate. What if you ask the robot to pour a mug into an open teapot? Presumably the vision model itself is less adept with this prompt; maybe the robot will still work, it's a simple task.

    But experience with ANNs suggests it's likely to falter in these off-the-golden-path cases, and that it'll falter in ways that are bizarre and unpredictable. I would have liked to see more comprehensive stress testing before using fancy terms like "spatio-temporal reasoning." AI does not need more fancy tech demos driving unrealistic hype.

    Regardless the results are very cool, and the underlying machinery is sophisticated without being too mysterious (once you accept the mysterious AI models it's based on...). I think the edge case issue might mitigate industrial deployment in e.g. a factory, but I think robotics tinkerers and hobbyists would have a blast with these ideas, and people much cleverer than me could even make a real product.

    • leetrout 2 hours ago

      I work tangentially to robots making use of computer vision... we're standing on the shoulders of many giants with OpenCV and ROS (for all their warts).

      Being able to get reliable object detection without custom training weights seems like the next domino to fall and become ubiquitous. Based on playing around with A111 and similar I suspect we're headed to a place where we can get reliable models that are small that will work well enough to accomplish your task of pouring it back into the tea pot.

      I also find the procedural animation demos with UE 5[0] interesting as well since what I see first hand is a lot of "key frame" robotics programming for complex movements... combining all these concepts will lead to some very unique solutions with a lot less hand-holding. Wonder how fast we'll get there...

      0: https://www.linkedin.com/feed/update/urn:li:activity:7235116...