• rfw300 2 hours ago

    There's a methodological flaw here—you asked the model to generate an example, and then simulate the results on that example. For all you know, that specific (very simple) example and its output are available on the internet somewhere. Try it with five strings of other random words and see how it does.

    • jononor 40 minutes ago

      This is key. LLMs will cheat and just look up training data whenever it can. Needs to be actively countered if one is to make it "reason".

      • jlmorton 42 minutes ago

        I did exactly that, and it was very close to the actual sklearn output.

      • jaccola 2 hours ago

        When I ran these prompts, I saw in the chain of thought

          Hmm, I need to run some code. I'm thinking I can use Python, right? There’s     this Python tool I can simulate in my environment since I can’t actually execute   the code. I’ll run a TfidfVectorizer code snippet to compute some outcomes.
        
        It is ambiguous, but this leads me to believe the model does have access to a Python tool. Also, my 'toy examples' were identical to yours, making me think it has been seen in the training data.

        This gave me a thought on the future of consumer-facing LLMs though. I was speaking to my nephew about his iPhone, he hadn't really considered that it was "just" a battery, a screen, some chips, a motor, etc.. all in a nice casing. To him, it was a magic phone!

        Technical users will understand LLMs are "just" next token predictors that can output structured content to interface with tools all wrapped in a nice UI. To most people they will become magic. (I already watched a video where someone tried to tell the LLM to "forget" some info...)

        • hhh 28 minutes ago

          If you're using ChatGPT or the Assistants API w/ managed tools (I dont remember if this is even available for o3-mini), it has access to a python execution tool.

          • CamperBob2 an hour ago

            85 IQ: LLMs are magic

            110 IQ: LLMs are "just" next token predictors that can output structured content to interface with tools all wrapped in a nice UI

            140 IQ: LLMs are magic

            • kazinator 34 minutes ago

              140 IQ: LLMS are magic ... token predictors that can output structured content

          • lblume 4 hours ago

            Well, the code in question is also written by the same LLM, so it could just output something it knows the answers to already. On its own, this result doesn't really seem to prove anything.

            • jlmorton 2 hours ago

              I tried with alternate values and got the same result - not quite precisely exact, but extremely close values.

            • cfcf14 an hour ago

              The obvious next step here is to see how well this generalises to arbitrary inputs :)

              • emsi 3 days ago

                OK, this is wild. I just saw o3-mini (regular) to precisely simulate (calculate?) output of quite complicated computations. Well, at least for a human… and no, it didn’t use code interpreter.

                • pkaye an hour ago

                  I was trying to solve this simple beam deflection problem and been getting inconsistent results in various models (O1 mini and Gemini 2.0 flash thinking experimental) between different runs. Do you get consistent deflection numbers?

                  > An 6061-T6 aluminum alloy hollow round 2 in diameter beam with 0.125 in thickness and length 120 in is simply supported at each end. A point load of 100 lb is applied at the middle. What is the deflect in the middle and 12 in from the ends.

                  • tripplyons 3 days ago

                    How do you know it didn't use a code interpreter if they don't share the chain-of-thought?

                    • politelemon 4 hours ago

                      Perhaps worth running a few more submissions to determine if it did use one or not.