• exq a day ago

    So it's okay when big American corps raid the internet ignoring any terms of service or licenses they see in order to train models they rent back to us, but when a foreign entity trains off of Anthropic it's illegal?

    • riku_iki a day ago

      From the tweet, Anthropic's point is that distillation is Ok, unless new model has safeguards removed or used for military or surveillance purposes.

      • dmonitor a day ago

        The fact that they're calling it an "attack" implies otherwise.

        I find the entire premise of this announcement absurd. Fraudulent accounts? They're just accounts. They paid for the access the same as any other. They're accessing Claude just like a human (or *claw) would.

        There's no argument against their strategy that doesn't make them complete hypocrites in respect to how they got the model training data in the first place.

        • mongrelion 17 hours ago

          I agree with you, especially with this:

          They paid for the access the same as any other.

          If anything, this makes them more legit than Anthropic because they are paying for the content, whereas Anthropic just stole *all* the data they got a hold of. So, in this case the Chinese AI labs stand on higher moral ground LOL.

          • riku_iki a day ago

            > them complete hypocrites in respect to how they got the model training data in the first place.

            sure, hypocrisies is part of rules for big games: politics and business.

            > Fraudulent accounts? They're just accounts.

            they tell the story in blog post, that they don't allow claude in China, but those labs use some proxy services to access claude and mix traffic with regular users to hids its activities

          • _aavaa_ a day ago

            I don’t think so. It reads much more like “distillation is okay when you do it to your own models.”

        • credit_guy a day ago

          I don't think this counts as distillation. Distillation is when you use a teacher model to train a student model, but crucially, you have access to the entire probability distribution of the generated tokens, not just to the tokens themselves. That probability distribution increases tremendously the strength of the signal, so the training converges much faster. Claude does not provide these probabilities. So, Claude was used for synthetic training data generation, but not really for distillation.

          • hooloovoo_zoo a day ago

            Sampling repeatedly gives them an estimate of the probability distribution in any case though.

            • hooloovoo_zoo a day ago

              That would be an interesting paper actually; what is the optimal sampling technique given you only have access to the token outputs. Surely someone has already done it.

          • m4rtink a day ago

            Oh no! They are stealing all the data we have stolen ourselves! This needs to be stopped and punished immediately!

            • veunes 13 hours ago

              If just 16 million examples were enough to significantly boost model quality (as Anthropic claims), it turns out that data quality beats quantity

              Instead of vacuuming petabytes of trash from Common Crawl, you can just take high-quality distillate from a SOTA model and get comparable results. Bad news for anyone betting solely on massive compute clusters and closed datasets

              • kingstnap a day ago

                Cry me a river, build a bridge, and get over it?

                They publish weights and useful research for everyone to benefit.

                I mean this is incredibly tone deaf for a company facing multiple lawsuits over where they got their training data from.

                • ChrisArchitect a day ago
                  • SilverElfin a day ago

                    One difference between Anthropic and others is that Anthropic is crawling publicly visible information, and their argument is that this is fair use. Whereas these Chinese LLMs are circumventing an account creating process and terms of service to misuse non public information.

                    Lots of people think Anthropic training their own LLM is the same but it really isn’t.

                    • saberience a day ago

                      Pot, meet kettle!

                      I don’t think I’m the only one feeling some schadenfreude at this news. I suppose it’s ok when you’re a hot Silicon Valley scale-up to slurp up the rest of the worlds data for free and then hire hot shot lawyers to defend you against all the creatives you ripped off, but when it’s the “evil” Chinese doing the same to you it’s a dastardly “attack”?

                      • m4rtink a day ago

                        Yeah - not only have we seend some of the same large companies that have trampled regular people and made examples of them in name of defending copyright fully ignore it when it was time to feed their AI models.

                        And now the hypocrisy went full circle with complains of others not respecting their rights!