• elcomet 2 hours ago

    It's a similar approach to OpenAI's o1 model ( it's not cited, but there's no available paper for o1).

    I don't see any mention of weight release unfortunately.

    • diggan 33 minutes ago

      I think this submission paper is talking about reinforcement learning as part of/after the main training, then the model does inference as normal.

      They might have done that for O1, but the bigger change is the "runtime train of thought" that once the model received the prompt and before giving a definitive answer, it "thinks" with words and readjusts at runtime.

      At least that's my understanding from these two approaches, and if that's true, then it's not similar.

      AFAIK, OpenAI been doing reinforcement learning since the first version of ChatGPT for all future models, that's why you can leave feedback in the UI in the first place.

      • numeri 17 minutes ago

        OpenAI stated [1] that one of the breakthroughs needed for o1's train of thought to work was reinforcement learning to teach it to recover from faulty reasoning.

        > Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working.

        That's incredibly similar to this paper, which is discusses the difficulty in finding a training method that guides the model to learn a self-correcting technique (in which subsequent attempts learn from and improve on previous attempts), instead of just "collapsing" into a mode of trying to get the answer right with the very first try.

        [1]: https://openai.com/index/learning-to-reason-with-llms/

        • nsagent 21 minutes ago

          Both models generate an answer after multiple turns, where each turn has access to the outputs from a previous turn. Both refer to the chain of outputs as a trace.

          Since OpenAI did not specify what exactly is in their reasoning trace, it's not clear what if any difference there is between the approaches. They could be vastly different, or they could be slight variations of each other. Without details from OpenAI, it's not currently possible to tell.

        • WithinReason an hour ago

          how is it similar?

      • ziofill 26 minutes ago

        Is this effectively some sort of knowledge distillation?

        • plaguuuuuu an hour ago

          LLMs have no direct recollection of the qualia of their own training. This is at least a major way that I self-correct myself: if I'm about to talk about something I know, I'll try and figure out how/why I know that thing and in so doing, try to gauge whether I actually know that thing, if I'm hallucinating, or if I actually heard it from a less than reliable source etc.

          I don't think LLMs can self-correct without remembering their own training in some way.

          • QuadmasterXLII 41 minutes ago

            So you’re saying the solution is to prefix each training batch with a description of a sensory experience (You read the following in a paris cafe in 1997. While you read, you have an excellent baguette and some boiled eggs, and over-roasted coffee. The woman one table over is wearing a beautiful blue hat) and then post-train the final model into recalling the setting where it read any piece of text, or failing to recall any experience when presented with text it didn’t read?

            (If someone tries this and it works, I’m quitting my phd and going back to camp counseling)

            • wpietri 18 minutes ago

              I don't think that's what they're saying at all. They're talking not about qualia in the human sense, but specifically about "the qualia of their own training". That is, the corpus that LLMs "learn" from and the "experiences" of those texts that are generalized during the training process. Both the raw data and the memory of "learning" is discarded.

              So if one were to improve an LLM along those lines, I believe it would be something like: 1) LLM is asked a question. 2) LLM comes up with an initial response. 3) LLM retrieves the related "learning" history behind that answer and related portions of the corpus. 4) LLM compares the initial answer with the richer set of information, looking for conflicts between the initial answer and the broader set, or "learning" choices that may be false. 6) LLM generates a better answer and gives it. 7) LLM incorporates this new "learning".

              And that strikes me as a pretty reasonable long-term approach, if not one that fits within the constraints of the current gold rush.

            • williamcotton 27 minutes ago

              Unless you’re under the influence of something or having a severe mental health crisis you are not hallucinating, you’re confabulating.

            • optimalsolver 2 hours ago

              Spoiler: You're never going to get rid of hallucinations in the autoregressive, next token prediction paradigm (aka LeCun's Law).

              The issue here is people trying to use language models as deterministic problem solvers, rather than for what they actually excel at (semi-creative text generation).

              • plewd 2 hours ago

                Is LeCun's Law even a thing? Searching up for it doesn't yield many results, except for a HN comment where it has a different definition. I guess it could be from some obscure paper, but with how poorly it's documented it seems weird to bring it up in this context.

                • YeGoblynQueenne an hour ago

                  I think the OP may be referring to this slide that Yann LeCun has presented on several occasions:

                  https://youtu.be/MiqLoAZFRSE?si=tIQ_ya2tiMCymiAh&t=901

                  To quote from the slide:

                    * Probability e that any produced token takes us outside the set of correct answers
                    * Probability that answer of length n is correct
                    * P(correct) = (1-e)^n
                    * This diverges exponentially
                    * It's not fixable (without a major redesign)
                  • sharemywin 22 minutes ago

                    Wouldn't this apply to all prediction machines that make errors.

                    Humans make bad predictions all the time but we still seem to manage to do some cool stuff here and there.

                    part of an agents architecture will be for it to minimize e and then ground the prediction loop against a reality check.

                    making LLMs bigger gets you a lower e with scale of data and compute but you will still need it to check against reality. test time compute also will play a roll as it can run through multiple scenarios and "search" for an answer.

                    • roboboffin 42 minutes ago

                      Is this similar to the effect that I have seen when you have two different LLMs talking to each other, they tend to descend into nonsense ? A single error in one of the LLM's output and that then pushes the other LLM out of distribution.

                      I kind of oscillatory effect when the train of tokens move further and further out of the distribution of correct tokens.

                      • diggan 32 minutes ago

                        > Is this similar to the effect that I have seen when you have two different LLMs talking to each other, they tend to descend into nonsense ?

                        Is that really true? I'd expect that with high temperature values, but otherwise I don't see why this would happen, and I've experimented with pitting same models against each other and also different models against different models, but haven't come across that particular problem.

                        • sharemywin 21 minutes ago

                          this is like the human game of telephone.

                        • atq2119 an hour ago

                          Doesn't that argument make the fundamentally incorrect assumption that the space of produced output sequence has pockets where all output sequence with a certain prefix are incorrect?

                          Design your output space in such way that every prefix has a correct completion and this simplistic argument no longer applies. Humans do this in practice by saying "hold on, I was wrong, here's what's right".

                          Of course, there's still a question of whether you can get the probability mass of correct outputs large enough.

                          • ziofill 23 minutes ago

                            Doesn’t this assume that the probability of a correct answer is iid? It can’t be that simple.

                            • littlestymaar 19 minutes ago

                              > * P(correct) = (1-e)^n * This diverges exponentially

                              I don't get it, 1-e is between 0 and 1, so (1-e)^n converge to zero. Also, a probability cannot diverge since it's bounded by 1!

                              I think the argument is that 1 - e^n converges to 1, which is what the law is about.

                            • vjerancrnjak 2 hours ago

                              “Label bias” or “observation bias” a phenomenon where going outside of the learned path lives little room for error correction. Lecun talks about the lack of joint learning in LLMs.

                              • mdp2021 2 hours ago

                                A reference could be this:

                                https://futurist.com/2023/02/13/metas-yann-lecun-thoughts-la...

                                (Speaking of "law" is rhetoric, but an idea is pretty clear.)

                              • og_kalu an hour ago

                                If you're talking about label bias then you don't need to solve label bias to 'solve' hallucinations when the model has already learnt internally when it's bullshitting or going off the rails.