• fxtentacle 4 hours ago

    The title is not wrong, but it also doesn't feel correct either. What they do here is they use a pre-trained model to guide the training of a 2nd model. Of course, that massively speeds up training of the 2nd model. But it's not like you can now train a diffusion model from scratch 20x faster. Instead, this is a technique for transplanting an existing model onto a different architecture so that you don't have to start training from 0.

    • byyoung3 3 hours ago

      Yes, now it seems obvious, but before this it wasn't clear that that would be something that could speed things up, due to the fact that the pretrained model was trained on a separate objective. It's a brilliant idea that works amazingly.

      • zaptrem 4 hours ago

        Yeah, I wonder whether this still saves compute if you include the compute used to train DINOV2/whatever representation model you'd like to use?

      • GaggiX an hour ago

        I wonder how well this technique works if the distribution of the training dataset between the diffusion model and the image encoder is quite different, for example if you use DinoV2 as the encoder but train the diffusion model on anime.

        • gdiamos 4 hours ago

          Still waiting for a competitive diffusion llm

          • kleiba 3 hours ago

            Why?

            • WithinReason 3 hours ago

              Diffusion works significantly better for images than sequential pixel generation, there is a good chance it would work better for language as well.

              Sequential generation used to be state of the art in 2016 and it's basically how current LLMs work:

              https://arxiv.org/abs/1601.06759

              • kleiba 2 hours ago

                Neural LMs used to be based on recurrent architectures until the Transformer came along. That architecture is not recursive.

                I am not sure that a diffusion approach is all that suitable for generating language. Word are much more discrete than pixels.

                • WithinReason 2 hours ago

                  I meant sequential generation, I didn't mean using an RNN.

                  Diffusion doesn't work on pixels directly either, it works on a latent representation.

                  • kleiba an hour ago

                    All NNs work on latent representations.

                    • barrkel 32 minutes ago

                      The contrast here is real: there are pixel space diffusion models and latent space diffusion models. Pixel space diffusion is slower because there's more redundant information.