• ungreased0675 a day ago

    I would speculate this is true of all the leading commercial LLM models. Don’t have enough training data? Just steal some!

    • Havoc 21 hours ago

      On true for all - you’d need to split it by era I think

      During the early Llama 1 days The Pile dataset was in heavy use by many. Bit later people figured out that a subset of it - Books 3 - was especially problematic.

      I’m guessing all the big houses threw that piece out in later models since it’s extra radioactive

      • archerx 17 hours ago

        What was problematic about it?

        • Havoc 14 hours ago

          Thousands of pirated copyrighted books

      • gooosle 16 hours ago

        Copy some*

        • pera 10 hours ago

          It depends who you are:

          - if you are an individual then it's called "pirating copyrighted work"

          - if you are a multi-billion dollar corporation then it's called "use of uncleared material for training"

          • throw5959 6 hours ago

            Intent matters. I'm very happy we don't live in a world where it doesn't matter.

          • BSDobelix 14 hours ago

            That's exactly the difference, one does not steal in the digital world. If i could download/copy a car i would do it ;)

            • dehrmann 16 hours ago

              Courts have yet to decide on which it is, and it might depend on how well the model can transform vs. recite.

              • vidarh 12 hours ago

                The point is that whatever courts decide, it is not theft. It may or may not be copyright infringement, but copyright infringement is not theft.

                • exe34 11 hours ago

                  but muh shareholders!

            • bodiekane 9 hours ago

              If we're going to use absurd hyperbole like "steal", I think we should just keep going further.

              Zuckerberg murdered some old library books to train a model. Zuckerberg genocided training data!

              Heck, everyone who read your comment here stole it. I'm so sorry for your loss.

            • dekhn 21 hours ago

              I believe it was already known that anything trained on The Pile contained references to copyrighted material from scihub. It seems unlikely that folks who chose to use these sources were completely unaware of the nature of the data. Presumably, given the urgency in the last 2-3 years to be a leader in this space, a number of shortcuts were taken.

              • nobrains 15 hours ago

                Zuck did a calculation: "Does the risk of lawsuits and bad PR outweigh the benefits of being early?".

                If u remove morals from the equation, nearly every CEO would have made that same decision if in that position.

                • throw5959 14 hours ago

                  You talk about morals, but did you consider that they are releasing the model as open source, and given that OpenAI and others do the same, Zuck is really the only current option to have a reasonably comparable open source model? Also, did you consider that it might be more moral to create an AI model than to uphold copyright law, which actually many on this site deem immoral?

                  IMHO this is a moral win on Zuck side.

                • undefined 18 hours ago
                  [deleted]
                • Havoc 21 hours ago

                  Stripping out the copyrights is quite damning.

                  There is wrongdoing and there is obvious evidence that you known what you’re doing is wrong. That really limits options on Defence

                  • palata 8 hours ago

                    Same old story: Meta is too big to care. What will happen? A fine? Sure, they can pay.

                    • vivzkestrel 19 hours ago

                      Stupid question: I have 400000 ebooks (yup pirated ones) what happens if I build an LLM with this?

                      • rcakebread 18 hours ago

                        You'd still ask stupid questions?

                        • blitzar 14 hours ago

                          You would have a net worth of 1bn

                          • fooker 12 hours ago

                            Depends on the parameter count.

                            Too high? Straight to jail.

                            Too low? Believe it or not, straight to jail.

                            • undefined 17 hours ago
                              [deleted]
                              • covofeee 13 hours ago

                                You also need $100m to train it

                                • wil421 12 hours ago

                                  Build Chappie.

                                  • anothername12 15 hours ago

                                    You’ll be fine. It’s like laundering money.

                                    • stuckkeys 19 hours ago

                                      Nothing.

                                      • gooosle 16 hours ago

                                        You go to jail forever.

                                        • solumunus 19 hours ago

                                          What do you imagine could happen?

                                        • asdefghyk a day ago

                                          Its also reported elsewhere ( in media articles linked to by Hacker News ) they torrented copyright material. AMAZING

                                          • rurban 15 hours ago

                                            Jail time? Or just multi-million fines.

                                            Will he be allowed to lead Meta if convicted as criminal?

                                            • covofeee 13 hours ago

                                              You saw the WP cartoon right?

                                              • exe34 11 hours ago

                                                he should run for election!

                                              • undefined 19 hours ago
                                                [deleted]
                                                • alightsoul 11 hours ago

                                                  Given how things are going, maybe it will be ruled as "fair use" whereas something like controlled digital lending at the internet archive was ruled as "infringing" disgusting. So AI might become the only "legal" way to access a lot of knowledge for free you otherwise wouldn't have access to.

                                                  • htrp 13 hours ago

                                                    i was under the impression that almost everyone trained on books3

                                                    • cma 13 hours ago

                                                      A book's copyright is no more valid than a website's

                                                      • musicale 19 hours ago

                                                        "I'm shocked, shocked to find out that piracy is going on here!"

                                                        "Your LLM, Captain Zuckerberg."

                                                        "Oh, thank you very much!"

                                                        • udev4096 17 hours ago

                                                          Everyone knows that LLMs are trained on shit ton of pirated content

                                                          • atulvi 20 hours ago

                                                            Good. These laws are anti progress.

                                                            • covofeee 13 hours ago

                                                              Copyright is your friend.

                                                              • horsawlarway 12 hours ago

                                                                No.

                                                                There is a theoretical implementation of copyright that is your friend.

                                                                The realities of the laws as implemented today are abusive and hostile.

                                                                • palata 8 hours ago

                                                                  Does it mean that they should be removed entirely? Surely we can agree on the fact that I should not be allowed to make a copy of a book, put my name on it instead of the real author, and sell it? Or even claim that I wrote it and put it on my resume?

                                                                • bodiekane 9 hours ago

                                                                  Copyright is the friend to the 1% and the enemy of the everyone else.

                                                                  (Of course, I'm using "the 1%" rhetorically, it's really more like 0.01%)

                                                                  As a society, we all clearly benefit from fair use far more than we benefit from members of the copyright cartel buying another mansion or private jet.

                                                                • ulfw 20 hours ago

                                                                  What "progress"?

                                                                  • pizza 20 hours ago

                                                                    Exfiltration of information from the economy

                                                                    • exe34 11 hours ago

                                                                      does the economy lose this information? are pages now missing from the books on your bookshelf?

                                                                  • idiotsecant 18 hours ago

                                                                    We're literally extracting, refining, and re-using the information, art, and thoughts of fellow humans to make billionaires money.

                                                                    This isn't the 90s. Computing isn't about discovery, not in the big leagues. Its about grinding up authenticity and feeding it into a machine to convert it into shareholder value.

                                                                    If they want the value, let them pay for it or release the models open source for all to benefit.

                                                                    • archerx 17 hours ago

                                                                      They have released all the models for free so far unlike other companies like OpenAI who are most likely doing the same but keeping it private and proprietary.