• ks2048 12 hours ago

    Odd that the page doesn't seem to link to either,

    paper: https://arxiv.org/abs/2502.04128

    github: https://github.com/zhenye234/LLaSA_training

    • thot_experiment 9 hours ago

      Interesting that there isn't a mention of Orpheus as prior art either since it's the exact same thing.

      (https://github.com/canopyai/Orpheus-TTS)

      • gapeleon 5 hours ago

        > Interesting that there isn't a mention of Orpheus as prior art either

        Llasa-3b (https://huggingface.co/HKUSTAudio/Llasa-3B) came out before Orpheus (https://huggingface.co/canopylabs/orpheus-3b-0.1-ft).

        > it's the exact same thing.

        They're very similar, but they're not the exact same thing.

        Llasa uses xcodec2, a much simpler, lossless 16khz wav codec. This makes it superior for one-shot voice cloning.

        Orpheus' 24khz snac codec is lossy which makes it difficult to use for zero-shot cloning as the reference audio gets degraded during tokenization. You can test this here: https://huggingface.co/spaces/Gapeleon/snac_test

        But when finetuned on 50+ audio samples, it produces much cleaner 24khz audio than Llasa, and the snac model is much easier to run on consumer hardware than xcodec2 (87t/s for realtime speech, which can be achieved on an RTX3080 for example)

        • oezi an hour ago

          Do you happen to know why Orpheus and Llasa use Finetuning for voice cloning?

          Zonos uses 128-float embeddings for voices and it seems so much nicer. Because you can just mix and match voices without changing the model.

          • oezi an hour ago

            Isn't xcodec2 also lossy? I thought it is also just another neural codec (50 tok/s, single codebook).

            What are people using to upsampling back to 44,1 or 48 khz? Anything fancy?

      • CalmStorm 15 hours ago

        LLaSA is a simple framework for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as LLaMA.

        • WastedCucumber 14 hours ago

          Probably the title should have the correct capitalization then. Cause I was fully expecting a speech synthesis tool that sounded like llamas talking human language and now I'm bummed out!

        • mring33621 13 hours ago

          the long 'uuuuhhhhhhh' from some of the lesser models is killing me.

          • gapeleon 5 hours ago

            This finetune seems pretty stable (1b llasa) https://huggingface.co/spaces/HKUST-Audio/Llasa-1B-multi-spe...

            1B is actually huge for a TTS model. Here's an 82m model with probably the most stable/coherent output of all the open weights tts models I've tested: https://huggingface.co/spaces/hexgrad/Kokoro-TTS

            But if you mean zero-shot cloning, yeah they all seem to have those slurred speech artefacts from time to time.

            • jszymborski 13 hours ago

              based on the samples, it really seams like anything smaller than 3B is pretty useless.

              • hadlock 12 hours ago

                If you're doing a home lab voice assistant 1B is nice, because on a 12gb gpu you can run a moderately competent 7b LLM and two 1b models; 1 for speech to text and also text to speech, plus some for the wake word monitor. Maybe in a couple of years we can combine all this into a single ~8b model that runs efficiently on 12gb gpu. Nvidia doesn't seem very incentivized right now to sell consumer GPUs that can run all this on a single consumer grade chip when they're making so much money selling commercial grade 48gb cards.

            • StevenNunez 14 hours ago

              I can't wait see this integrated into Open WebUI! These sound amazing.

              • gapeleon 5 hours ago

                You can run an openai-compatible endpoint and point open-webui at it if you want this. I had to add a function to filter out markdown lists, code, etc as the model was choking on them.

              • dheera 12 hours ago

                > employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align

                I really wish when new models were released that they would draw a diagram of all the layers and the tensor input and output sizes at each layer, with zoom in/out capabilities if needed using D3.js or whatever visualization framework if needed. Every single layer should be on there with its input and output sizes.

                These one-sentence descriptions, and approximate block diagrams with arrows pointing at each other are never enough to understand how something is actually implemented.

                • dr_kiszonka 5 hours ago

                  That might be intentional.

                  • exe34 11 hours ago

                    Sounds like a solid SaaS business plan!