• homarp 21 days ago

    Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.

    Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.

    • GaggiX 21 days ago

      There is also a Voxtral Small 24B small model available to be downloaded: https://huggingface.co/mistralai/Voxtral-Small-24B-2507

      • ipsum2 21 days ago

        24B is crazy expensive for speech transcription. Conspicuously no comparison with Parakeet, a 600M param model thats currently dominating leaderboards (but only for English)

        • azinman2 21 days ago

          But it also includes world knowledge, can do tool calls, etc. It’s an omnimodel

          • qwertox 20 days ago

            Only the mini is meant for pure transcription. And with the tests I just did on their API, comparing to Whisper large, they are around three times faster, more accurate and cheaper.

            24B is, as sibling comment says, an omni model, it can also do function calling.

          • kamranjon 21 days ago

            Im pretty excited to play around with this. I’ve worked with whisper quite a bit, it’s awesome to have another model in the same class and from Mistral, who tend to be very open. I’m sure unsloth is already working on some GGUF quants - will probably spin it up tomorrow and try it on some audio.

            • lostmsu 21 days ago

              Does it support realtime transcription? What is the ~latency?

              • rolisz 21 days ago

                Unlikely. The small model is much larger than whisper (which is already hard to use for realtime)

              • vivalapomy 16 days ago

                Won't comment on the 24B model as I see no use for it personally, but regarding purely ASR tasks, I honestly can't see voxtral taking off. For personal usage, I've been running a quant of whisper tiny(for english), as well as whisper small(for spanish, as is my native language), and have never experienced major latency when using for globally available voice commands. Considering my machine runs an Ivy Bridge processor, using CPU inference, the pricing seems unreasonable.

                • sheerun 21 days ago

                  In demo they mention polish prononcuation is pretty bad, spoken as if second language of english-native speaker. I wonder if it's the same for other languages. On the other hand whispering-english is hillariously good, especially different emotions.

                  • Raed667 21 days ago

                    It is insane how good the "French man speaking English" demo is. It captures a lot of subtleties

                    • potlee 20 days ago

                      That’s an actual French man speaking English

                  • lostmsu 21 days ago

                    My Whisper v3 Large Turbo is $0.001/min, so their price comparison is not exactly perfect.

                    • ImageXav 21 days ago

                      How did you achieve that? I was looking into it and $0.006/min is quoted everywhere.

                      • lostmsu 21 days ago

                        Harvesting idle compute. https://borgcloud.org/speech-to-text

                        • BetterWhisper 21 days ago

                          Do you support speaker recognition?

                          • lostmsu 21 days ago

                            No. I found models doing that unreliable when there are many speakers.

                          • 4b11b4 21 days ago

                            This is your service?

                            • lostmsu 21 days ago

                              Yes

                      • danelski 22 days ago

                        They claim to undercut competitors of similar quality by half for both models, yet they released both as Apache 2.0 instead of following smaller - open, larger - closed strategy used for their last releases. What's different here?

                        • halJordan 21 days ago

                          They didn't release voxtral large so your question doesn't really make sense

                          • danelski 21 days ago

                            It's about what their top offering is at the moment, not having Large in name. Mistral Medium 3 is notably not Mistral Large 3, but it was released as API-only.

                          • wmf 21 days ago

                            They're working on a bunch of features so maybe those will be closed. I guess they're feeling generous on the base model.

                            • Havoc 21 days ago

                              Probably not looking to directly compete in transcription space

                            • homarp 22 days ago
                              • homarp 22 days ago

                                Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.

                                Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.