• swyx 4 hours ago

    D.O.A without adoption from the major model labs (including the "opener" ones like AI2 and lets say Together/Eleuther). i dont like the open source old guard feeling like they have any say in defining things when they dont have skin in the game. (and yes, this is coming from a fan of their current work defending the "open source" term in traditional dev tools). a good way to ensure decline to irrelevance is to do a lot of busywork without ensuring a credible quorum of the major players at the table.

    please dont let me discourage tho, i think this could be important work but if and only if this gets endorsement from >1 large model lab producing any interesting work

    • sigh_again 3 hours ago

      > they have any say in defining things when they dont have skin in the game.

      Then, maybe don't go around stealing and bastardizing the "open source" concept when absolutely none of the serious AI research is open source or reproductible. Just because you read a fancy word online once and think you can use it doesn't mean you're right.

      • jszymborski 3 hours ago

        > D.O.A without adoption from the major model labs

        I definitely disagree. Adoption of open licenses has historically been "bottom-up", starting with academia and hobbyists and then eventually used by big names. I have zero idea why that can't be the case here.

        I know I'll be releasing my models under an open license once finalized.

        • blackeyeblitzar 3 hours ago

          Why should the “old guard” not have to have the say when they came up with the idea of open source? It is misleading to adopt terminology with well known definitions and abuse it. People like Meta are free to use some other terminology that isn’t “open source” to describe their models, which I cannot reproduce because they’ve release nothing except weights and inference code.

        • wmf an hour ago

          Various organizations are willing to release open weights but not open source weights according to this definition, so this is going to be a no-op. Open source already existed before the OSI codified it, but now they're trying to will open source AI into existence against tons of incentives not to.

          • godelski 3 hours ago

            I don't think this makes sense nor is consistent with itself, let alone its other definition[0]

              > The aim of Open Source is not and has never been to enable reproducible software.
              ...
              > Open Source means giving anyone the ability to meaningfully “fork” (study and modify) a system, without requiring additional permissions, to make it more useful for themselves and also for everyone. 
              ...
              > Forking in the machine learning context has the same meaning as with software: having the ability and the rights to build a system that behaves differently than its original status. Things that a fork may achieve are: fixing security issues, improving behavior, removing bias.
            
            For these things, it does mean what most people are asking for: training details.

            So far companies are just releasing checkpoints and architecture. It is better than nothing and this is a great step (especially with how entrenched businesses are[1]). But if we really want to do things like fixing security issues or remove bias, you have to be able to understand the data that it was originally trained on AND the training procedures. Both of these introduce certain biases (via statistical definition, which is more general). These issues can't all be solved by tuning and the ability to tune is significantly influenced by these decisions.

            The reason we care about reproducible builds is because it matters to things like security, where we know what we're looking at is the same thing that's in the actual program. It is fair to say that the "aim" isn't about reproducible software, but it is a direct consequence of the software being open source. Trust matters, but the saying is "trust but verify". Sure, you can also fix vulns and bugs in closed source software, hell, you can even edit or build on top of it. But we don't call these things open source (or source available) for a reason.

            If we're going to be consistent in our definitions, we need to understand what these things are at at least a minimal level of abstraction. And frankly, as a ML researcher, I just don't see it.

            That said, I'm generally fine with "source available" and like most people use it synonymous with "open source". But if you're going to go around telling everyone they're wrong about the OSS definition, at least be consistent and stick to your values.

            [0] https://opensource.org/osd

            [1] Businesses who's entire model depends on OSS (by OS's definition) and freely available research

            • ensignavenger 2 hours ago

              "Reproducible build" is a term used to refer to getting an exact binary match out of a build. This is outside the scope of the OSD. I am not certain, but it sounds like this is what they are talking about here. Just because you run the build yourself doesn't mean you will get an exact match of what the original producer built. Something as simple as a random number generator or using a timestamp in the build will result in a mismatch.

            • tananaev 4 hours ago

              The definition is good because currently many call their open model weights as open "source". But I suspect most companies will still call their models open source even when they're not.

              • datascientist 4 hours ago
                • blackeyeblitzar 3 hours ago

                  A reinforcement of definitions is needed. Open weights is NOT open source. But there are people like Meta that are rampantly open washing their work. The point of open source is that you can recreate the product yourself, for example by compiling the source code. Clearly the equivalent for an LLM is being able to retrain the model to produce the weights. Yes I realize this is impractical without access to the hardware, but the transparency is still important, so we know how these models are designed, and how they may be influencing us through biases/censorship.

                  The only actually open source model I am aware of is AI2’s OLMo (https://blog.allenai.org/olmo-open-language-model-87ccfc95f5...), which includes training data, training code, evaluation code, fine tuning code, etc.

                  The license also matters. A burdened license that restricts what you can do with the software is not really open source.

                  I do have concerns about where OSI is going with all this. For example, why are they now saying that reproducibility is not a part of the definition? These two paragraphs below contradict each other - what does it mean to be able to “meaningfully fork” something and be able to make it more useful if you don’t have the ingredients to reproduce it in the first place?

                  > The aim of Open Source is not and has never been to enable reproducible software. The same is true for Open Source AI: reproducibility of AI science is not the objective. Open Source’s role is merely not to be an impediment to reproducibility. In other words, one can always add more requirements on top of Open Source, just like the Reproducible Builds effort does.

                  > Open Source means giving anyone the ability to meaningfully “fork” (study and modify) a system, without requiring additional permissions, to make it more useful for themselves and also for everyone.

                  • MichaelNolan 3 hours ago

                    > what does it mean to be able to “meaningfully fork” something and be able to make it more useful if you don’t have the ingredients to reproduce it in the first place?

                    I could be misunderstanding them, but my takeaway is that exact bit for bit reproducibility is not required. Most software, including open source, is not bit for bit reproducible. Exact reproducibility is a fairly new concept. Even with all the training data, and all the code, you are unlikely to get the exact same model as before.

                    Though if that is what they mean, then they should be more explicit about it.

                  • glkanb 3 hours ago

                    Ok, decent first steps. Now approve a BSD license with an additional clause that prohibits use for "AI" training.

                    Just like a free grazing field would allow living animals, but not a combine harvester. The old rules of "for any purpose" no longer apply.

                    • exac 3 hours ago

                      > The aim of Open Source is not and has never been to enable reproducible software.

                      Okay, well just because you have the domain name "opensource.org" doesn't give you the ability to speak for the community, and the community's understanding of the term.

                      opensource.org is irrelevant.

                      • saurik 3 hours ago

                        I mean, I've never understood "open source" to require reproducibility? That concept barely even existed as a thing people strove for until 15 years ago, a lot of software still only barely supports such, and there are tons of tradeoffs that come with it (as you effectively then also inherit your entire toolchain as vendor maintained, and a lot of projects end up making that result in awkward binaries, as almost no one reproduces entirely from a small bit of bootstrapped lisp).

                        • FrustratedMonky 3 hours ago

                          I agree.

                          "never been to enable reproducible software"

                          I'd say, sure "Never" is a big word.

                          Having open code that everyone can read and run, was partly to allow for reproducibility. In the closed world, how is anybody reproducing anything, being open does enable that.

                          • saurik 3 hours ago

                            The article seems to cover this nuance in the next paragraphs?