« BackUpdate on Reflection-70Bglaive.aiSubmitted by mellosouls 5 hours ago
  • ipsum2 4 hours ago

    Sahil Chaudhary of GlaiveAI perpetuated fraud, where he replaced the model that he "trained" with other backend ML providers. He still has not given a reason why "Claude" the string would be missing, just magically happened, despite the base model, Llama3.1-70B having no issues producing the text "Claude" nor the dataset missing the string "Claude"!

    Note that there was additional proof, besides missing the string "Claude", by matching the max number of tokens the model was able to produce. This is more technical, but chatGPT, Claude, Llama all have different tokenizers, so words are be broken up into different sections. The API consistently did NOT match the base model tokenizer (Llama), instead, producing the same number of tokens as Claude.

    Companies and individuals should probably avoid GlaiveAI and Matt Shumer less they get scammed too.

    • lolinder 3 hours ago

      Sorry, I'm having trouble finding more information about this—what is the significance of the model being unable to produce the string "Claude"? Was this some sort of half-hearted censorship to prevent it from disclosing its name? Where can I read more?

    • nisten 4 hours ago

      they're 140Gb folders, for each checkpoint, yes file corruption happens

      and as for the fraud part...it was an opensource model release that did not meet the claimed benchmarks when people tried to replicate it

      • bastawhiz 3 hours ago

        The fraud part was multiple independent sources producing fairly indisputable evidence that their "hosted" version of the model was just running GPT and Claude. That alone is enough to completely discredit absolutely everything about this work.

        As for corruption, I don't believe the excuse "yes file corruption happens". They're model weights. If this was trained (in real life) it was done on some serious hardware with disks with error correction. They weren't storing the checkpoints on microSD cards. It's certainly possible that there was really unfortunate luck and there was corruption, but I don't find that excuse to be plausible. Especially when this is your business (and launch!)

        • ipsum2 3 hours ago

          Definition of fraud, from Google:

          * wrongful or criminal deception intended to result in financial or personal gain.

          * a person or thing intended to deceive others, typically by unjustifiably claiming or being credited with accomplishments or qualities.

          Since they were advertising GlaiveAI as this magical source of data where they trained a model that performed better than Claude and chatGPT, I think this firmly falls into that camp! Your definitions may be different than mine.

          • nisten 40 minutes ago

            it was a free opensource model release, the api was not for sale, there are literally over a million FREE models on huggingface.

            • alsodumb 7 minutes ago

              Who cares if the model was free? No one said they were trying to commit fraud by releasing that model, they were trying to commit fraud by subtly advertising that their companies/products had the secret sauce to make state-of-the-art models which they obviously didn't.

          • alsodumb 3 hours ago

            Are you telling me someone trained a huge model, and served it for hours to tons of users, and had only one instance of the checkpoint? I call BS.

            The model being open-source doesn't mean what they could have gotten away with, or tried to, isn't fraud.

            • bhouston 2 hours ago

              He served tons of people from his personal laptop? How is that possible? A 70B LLM is pretty taxing even to serve a single user let along the crush of users that tried out this new hyped model no? What am I missing?

        • coolspot 4 hours ago

          On one hand, I want to believe Sahil, on the other hand most of his explanations don’t make much sense:

          Can’t upload exact weights he had on his computer. The guy runs AI hosting/inference/training company - can’t upload weights he has!

          Original benchmark harness wasn’t shared, but had a bug that conveniently boosted model results.

          API somehow mysteriously censors model name and tokenizer is exact match to Claude.

          • ameliaquining an hour ago

            He seems to be claiming that anyone can now reproduce the weird Claude censorship locally with the uploaded weights. Has anyone checked whether that's true or not, or is he mischaracterizing the allegations?

            • xena an hour ago

              I'm going to be downloading the weights and doing local verification

              • BoorishBears an hour ago

                I think the most damning thing about this whole saga for all of AI is how much energy and attention people are giving it.

                In most established verticals, such a cartoonish scam would be dead on arrival. But apparently generative AI is still not mature enough to just move past this kind of garbage in a clean break.

                • xena 35 minutes ago

                  To be fair, the AI industry is used to people manifesting out of nowhere doing something stupid and then ending up with revolutionary results. It's no surprise that there's a default optimism (especially if it pans out because then that makes running high quality AI stuff so much cheaper).

                  • refulgentis 32 minutes ago

                    It's not a cartoonish scam, and if it was, it took 48 hours to fall apart. Not worth getting the Jump to Conclusions™ mat out for.

                    This isn't said aggressively or to label, but rather, to provide some context that it's probably not nearly as simple as you are suggesting: this thread looks like a bunch of confused engineers linking drama threads from laymen on Twitter/Reddit to eachother, seeing pitchforks, and getting out their own. Meanwhile, the harsh conclusions they jump to are belied by A) having engineering knowledge _and_ looking into their claims B) reading TFA

              • all2 4 hours ago

                I've seen stuff like this hacked together. If he isn't very organized or was hasty, there's a good bet he deleted the working weights or doesn't know which of 5 or 10 the weights it is.

                Nothing would stop him from uploading all the weights, I suppose...

                • ipsum2 4 hours ago

                  No. He served the "weights" (actually Claude) for over 24 hours. It's practically impossible to have served the "correct weights" and just have lost them.

                  • Havoc 3 hours ago

                    >It's practically impossible to have served the "correct weights" and just have lost them.

                    Deleting files is very much a thing

                    • minimaxir 3 hours ago

                      The AI dog ate his homework?

              • nisten 32 minutes ago

                Has anyone here actually ran the code on their own hardware yet?

                I did a standard non-middleware lm_eval_harness and got 0.3214 on gpqa_main_zeroshot WITH the systemprompt and 0.3616 without the systemprompt.

                Have not ran it with the middleware yet that's supposed to do the substraction. Now, if that adds 20% to the score, that would be a huge deal, but it would also roughly match the jump from gpt4o to o1-preview that they got in gpqa_diamond.

                • thorum 4 hours ago

                  > This along with a few tokenizer related tests people ran, made people suspect that we are just serving Claude with post-processing where we filter out words like Claude.

                  Didn't these "few tokenizer related tests" prove the API was using Claude's tokenizer instead of Llama's, based on how words were being divided into tokens?

                  That's a hard one to explain (it doesn't appear they're even trying to).

                  • refulgentis 30 minutes ago

                    People keep asserting that but, really, it was just people pointing to setting max tokens to a certain value and getting a certain # of words out. They didn't actually have tokens. Perfectly possible to have collisions, I'd wager even likely in the scenarios they tested, simple question, < 10 tokens, in English.

                  • kristianp 5 hours ago

                    If this is for real, in some ways it shows how small OpenAIs moat is. Once someone knows something is possible and the rough idea, the community can replicate it in 4 weeks.

                    • jsheard 4 hours ago

                      Isn't Reflection supposed to be based on CoT like o1? It was originally released a week before o1 was, so if it was the real deal all along then OpenAI were outright beaten to the punch rather than replicated after the fact.

                      • ipsum2 4 hours ago

                        No. CoT has been around for several years (Jan 2022) https://arxiv.org/abs/2201.11903. And so has Reflection (March 2023) https://arxiv.org/abs/2303.11366. This approach that was taken by Reflection is nothing new.

                        • thorum 4 hours ago

                          CoT moderately improves model performance, but all non-o1 models suck at actually thinking step by step effectively. If the task is not straightforward, they make obvious mistakes or default to guessing.

                          OpenAI trained o1 to pick better steps in its chain of thought. (The moat is the dataset they used to do that.)

                          • bastawhiz 3 hours ago

                            Maybe, if it wasn't an outright fraud. Arguably they didn't beat anyone to anything.

                            • refulgentis 28 minutes ago

                              > Maybe, if it wasn't an outright fraud.

                              I mean, it obviously wasn't, did you read the thing we're commenting on? n.b. At this point, you have all you need to replicate it. Far shy of "outright fraud", though, I'm sure there's a bailey for that motte.

                            • nisten 4 hours ago

                              yes and the massive increase to GPQA scores from o1 was attributed to this technique, so there is something there, despite the hard feelings of unproductive reddit users

                          • Havoc 3 hours ago

                            An expensive lesson in how fragile reputations can be

                            • bhouston 2 hours ago

                              I am confused. He was hosting the 70B LLM everyone was demoing from his laptop? How can that serve the load? When I’ve run LLMs locally it is really taxing for just one concurrent session.

                              • alsodumb 3 hours ago

                                I don't trust Sahil and Matt. They tried to commit fraud, hype things up, but it got way too much attention than what they expected, so they tried to get away with just serving Claude/ChatGPT in the background but got caught. They are nothing but grifters who got caught and now trying to fix that image.

                                • m3kw9 4 hours ago

                                  Is either fraud or incompetency, I say is just incompetency like what he said. Got too excited on some false test maybe they tested with some validation data mixed in

                                  • bastawhiz 3 hours ago

                                    So when they put the hosted model online (which was actually just proxying Claude), they explicitly prompted Claude to censor its own name. That's not explainable with incompetence. It's very intentional deception.

                                    • jazzyjackson 2 hours ago

                                      I just can't facepalm enough seeing so called AI companies relying on pythons .replace() when they need to hide what service they're building on

                                  • blackeyeblitzar 4 hours ago

                                    Past discussions about this model and its controversies:

                                    Reflection 70B, the top open-source model https://news.ycombinator.com/item?id=41459781

                                    Confirmed: Reflection 70B's official API is a wrapper for Sonnet 3.5 https://news.ycombinator.com/item?id=41484981

                                    • nisten 4 hours ago

                                      There is no official API, you're confirming a temporary one that was taken down weeks ago. That was done via OpenRouter, a routing api service/site, which routes to different model's under load.

                                      Yes they could've switched it themselves too.

                                    • ilaksh 4 hours ago

                                      The models are very powerful. This can help anyone, including scammers. The number of scams will be enormous.

                                      I just hope people are also able to see how useful it is for non-scumbags and don't let the scammers ruin it for everyone else. I am not going to mention another similar category of technology in this regard, just to stay "politically correct" for this site.

                                      • talldayo 3 hours ago

                                        > I just hope people are also able to see how useful it is for non-scumbags and don't let the scammers ruin it for everyone else.

                                        I hope so too - it's been four years since GPT-3 came out and I haven't found a single serious time-saving application for the technology.

                                        If someone doesn't start making money with LLMs soon, then it will only be the scammers who benefit!