« BackLlama 3.1 Omni Modelgithub.comSubmitted by taikon 10 months ago
  • londons_explore 10 months ago

    Can this play sounds that can't be represented in text? Ie. "make the noise a chicken makes"

    • OJFord 10 months ago

      Bwaaaaaak bwakbwakbwak

      • varispeed 10 months ago

        I see you like dubstep

        • JodieBenitez 10 months ago

          your pseudo... a person of taste for sure.

      • evilduck 10 months ago

        If it can create sounds associated with any nonphonetic word spellings, I can't see why it would struggle with onomatopoeia.

        • oezi 10 months ago

          Can it understand those sounds? And distinguish correct and incorrect pronunciation of words and accents?

          • DrSiemer 10 months ago

            Almost certainly not. Sounds like an old school vocoder, made to produce human speech and nothing else.

            • 8note 10 months ago

              As in, a cluck?

              But can it both say the word cluck, and make a clicking sound?

              • hansenliang 10 months ago

                asking the real questions

                • indigodaddy 10 months ago

                  Very interesting question actually

                  • undefined 10 months ago
                    [deleted]
              • twoodfin 10 months ago

                I’m not clear on the virtues or potential of a model like this over a pure text model using STT/TTS to achieve similar results.

                Is the idea that as these models grow in sophistication they can properly interpret (or produce) inflection, cadence, emotion that’s lost in TTS?

                • a2128 10 months ago

                  There's a lot of data loss and guessing with STT/TTS.

                  An STT model might misrecognize a word, but an audio LLM may understand the true word because of the broad context. A TTS model needs to guess the inflection and it can get it completely wrong, but an audio LLM could understand how to talk naturally and with what tone (e.g. use a higher tone if it's interjecting)

                  Speaking of interjection, an STT/TTS system will never interject because it relies on VAD and heuristics to guess when to start talking or when to stop, and generally the rule is to only talk after the user stopped talking. An audio LLM could learn how to conversate naturally, avoid taking up too much conversation time or even talk with a group of people.

                  An audio LLM could also produce music or sounds or tell you what the song is when you hum it. There's a lot of new possibility

                  I say "could learn" for most of this because it requires good training data, but from my understanding most of these are currently just trained with normal text datasets synthetically turned into voice with TTS, so they are effectively no better than a normal STT/TTS system; it's a good way to prove an architecture but it doesn't demonstrate the full capabilities

                  • langcss 10 months ago

                    You need a lot more power. I found gpt4o struggles doing basic OCR of printed text by hallucinating alot, while tesseract engine (old skool) gets it perfect. You need the model to be powerful enough to do everything.

                    You can work around this by the way by sending the output through a checking stage.

                    So picture -> gpt4o -> out1, picture -> tesseract -> out2, out1,out2 -> llm.

                    Might work for sound too.

                    • falcor84 10 months ago

                      Interesting, I've actually been using gpt4o extensively for OCR and didn't encounter any significant issues - could I ask you to please give an example of an image of (otherwise legible) text that it hallucinates on?

                      • langcss 10 months ago

                        I'll send you an email.

                        • schrodinger 10 months ago

                          Same, it's perfect at OCR. Generating an image with text in it however… nope!

                        • killerstorm 10 months ago

                          Speech is inherently easier to represent as a sequence of tokens than a high-resolution image.

                          Best speech to text is already NN transformer based anyway, so in theory it's only better to use a combined model

                      • spuz 10 months ago

                        Personally, I'm very much looking forward to using a speech model like OpenAI's advanced voice mode to learn language. It can already do things like speak quickly or slowly which traditional TTS systems can't. Also, in theory a speech model could tell me if my pronunciation is accurate. It could correct me by repeating my incorrect pronunciation and then providing the correct pronunciation. I don't actually know how capable OpenAI's advanced voice mode is in this regard because I haven't seen anyone actually test this but I'm extremely curious to try it myself. If other voice models can achieve this then it will be an incredible tool for language learning.

                        • paulryanrogers 10 months ago

                          Traditional TTS can certainly be cranked up in speed. Low/no vision users often listen at 2-3x.

                        • theptip 10 months ago

                          Lots has been written on this subject, check out OpenAI’s papers on -O for example.

                          Latency is a big one due to batching. You can’t really interrupt the agent, which makes actual conversation more clunky. And yes, multimodal has better understanding. (I haven’t seen analysis of perception of emotions, has anyone seen analysis of this capability for GPT-O?)

                          • Reubend 10 months ago

                            Essentially, there's data loss from audio -> text. Sometimes that loss is unimportant, but sometimes it meaningfully improves output quality.

                            However, there are some other potential fringe benefits here: improving the latency of replies, improving speaker diarization, and reacting to pauses better for conversations.

                            • fragmede 10 months ago

                              Really

                              Yeah that's the point. Without punctuation, no one can tell what inflection my "really" above should have, but even if it'd been "Really?" or "Really!", there's still room for interpretation. With a bet on voice interfaces needing a Google moment (wherein, prior to Google, search was crap) to truely become successful (by interpreting and creating inflection, cadence, emotion, as you mentioned), creating such a model makes a lot of sense.

                              • bubaumba 10 months ago

                                > I’m not clear on the virtues or potential of a model like this over a pure text model

                                you can't put pure text with keyboard on a robot. it will become a wheeled computer.

                                actually this is a cool thing as a companion / assistant.

                              • dingdingdang 10 months ago

                                Does any of the model-runners support this? Ollama, LM Studio, llama.cpp?

                                • cvzakharchenko 10 months ago

                                  So it's not STT -> LLM -> TTS? If I scream Chewbacca noises as input, will the model recognize it as nonsense, or will it interpret it with some lousy STT as some random words?

                                  • a2128 10 months ago

                                    It's not, but it probably won't recognize it as nonsense. According to the paper,

                                    > we construct a dataset named InstructS2S-200K by rewriting existing text instruction data and performing speech synthesis

                                    It has only been trained on questions spoken by TTS, it has never seen (heard) nonsense. Most likely it'll just hallucinate that you asked some question and it'll generate some answer instead of asking if you're good. There's just not many audio datasets with real voices, there's no audio version of StackOverflow to be scraped

                                    • schrodinger 10 months ago

                                      What about every movie that's been made? Although it might need to stick to those more than 100 yrs old to avoid copyright law?

                                    • vintermann 10 months ago

                                      I used to have fun with that. Set Google Translate to Chinese (Or some other language I don't speak, though tonal languages seemed to work better), make some vague noises into it, and get out coherent but crazy phrases in English.

                                    • LorenDB 10 months ago

                                      The TTS voice in the demo clip sounds remarkably like Ellen McLain (Valve voice actor).

                                      https://en.m.wikipedia.org/wiki/Ellen_McLain

                                      • spencerchubb 10 months ago

                                        Sounds like it's trained on LJ Speech dataset, which is one of the best datasets and very commonly used

                                      • nickthegreek 10 months ago

                                        The speed looks very nice. I just recently setup LMStudio + AnythingLLM to try out local voice chat and its still a little slower than I'd like but the PiperTTS voices are nicer than this.

                                        • aussieguy1234 10 months ago

                                          Not bad for 3 days training, voice output quality needs some work, it will be interesting to see what effect more training will have.

                                          • cuuupid 10 months ago

                                            Wish there was training or finetuning code, as finetuning voices seems like a key requirement for any commercial use.

                                            • drcongo 10 months ago

                                              Am I the only one who trusts a GitHub repo much less when it has one of those stupid star history graphs on the readme?

                                              • rafram 10 months ago

                                                That’s silly. People are allowed to be proud of their work.

                                                • drcongo 10 months ago

                                                  Thanks for the counterpoint, guess I'm just an old cynic.

                                              • opdahl 10 months ago

                                                Any demos showcasing it’s performance?

                                                • potatoman22 10 months ago
                                                  • opdahl 10 months ago

                                                    Thank you.

                                                    Obviously it doesn’t sound human but that’s extremely impressive for an 8B model. Compared to the Moshi model also on the front page now, this model seems to be more coherent, but maybe less conversational?

                                                  • twobitshifter 10 months ago

                                                    There is a demo video on the page

                                                  • barrenko 10 months ago

                                                    I assume one can't finetune this further?

                                                    • undefined 10 months ago
                                                      [deleted]