• ascorbic 2 hours ago

    The o1 example is interesting. In the CoT summary it acknowledges that the most recent official information is 1611m, but it then chooses to say 1622 because it's more commonly cited. It's like it over-thinks itself into the wrong answer.

    • freehorse an hour ago

      Does it search the internet for that? I assume so because else claiming how often something is cited does not make sense, but would be interesting to know surely. Even gpt4o mini with kagi gets it right with search enabled (and wrong without search enabled - tried over a few times to make sure).

      • sd9 6 minutes ago

        I don’t think the public o1 can search the internet yet, unlike 4o. In principle it could know that something is more commonly cited based on its training data. But it could also just be hallucinating.

        • asl2D 40 minutes ago

          Can the claim about citation frequency be just an answer pattern and not model's exact reasoning?

          • freehorse 29 minutes ago

            Yeah there could be parts of the training set with 1611 being explicitly called the official and 1622 being explicitly called the most common answer. But it can also have access to search results directly I think. Is there a way to know if it does or not?

      • 0xKelsey 2 days ago

        > The scenario that I’m worried about, and that is playing out right now, is that they get good enough that we (or our leaders) become overconfident in their abilities and start integrating them into applications that they just aren’t ready for without a proper understanding of their limitations.

        Very true.

        • Terr_ 9 minutes ago

          > Welcome to the era of generative AI, where a mountain can have multiple heights, but also only one height, and the balance of my bank account gets to determine which one that is. All invisible to the end user and then rationalised away as a coincidence.

          I've always found the idea of untraceable, unfixable, unpredictable bugs in software... Offensive. Dirty. Unprofessional.

          So it's been disconcerting, watching as a non-trivial portion of "people like me" appear willing to overlook such things in LLMs, while integrating them into flows where bad output cannot be detected.