• Rzor a year ago

    From the article: Please note: it’s not designed for real-time interactive scenarios like chatting, more suitable for data processing and other offline asynchronous scenarios. Repo: https://github.com/lyogavin/Anima/tree/main/air_llm

    • Gloomily3819 10 months ago

      What a misleading article. I thought they'd done some breakthrough in resource efficiency. This is just the old and slow method tools like Ollama used.

      • logicallee 10 months ago

        Do you know how much disk space this takes total? When I ran it, it downloaded nearly 30 gigabytes of models and seemed to be on track to download 28 more 5 gigabyte chunks (for a total of 150 gigabytes of disk space or maybe more). What is the total size before it finishes?

        • lostmsu 10 months ago

          70B parameters * 2 bytes each (fp16 or bf16) = 140GB

          I wish models sizes were published in bytes.

          • logicallee 10 months ago

            Thanks, I finished downloading it (which took many hours) onto an external hard drive (by adding a HF_HOME environmental variable for where to store that cache). Its size was 262 GB.

        • gavmor 10 months ago

          What method is that? Layer offloading?

          • Hugsun 10 months ago

            Yes, it's either that, or CPU inference. The article doesn't say.

            It doesn't mention quantization either.

        • 0cf8612b2e1e 10 months ago

          Any sense of speed? My assumption is that shuttling the weights in/out of the GPU is slow. Does GPU load + processing beat an entirely CPU solution? Doubly so if it is a huge model where the model cannot sit fully in RAM?

          • p1esk 10 months ago

            Depends on your CPU. I once tried 70b llama on 256 thread Epyc, it was around 1/10 of A100 (80GB) speed.

            • logicallee 10 months ago

              how much disk space did it use?

              • p1esk 10 months ago

                I didn’t check, but iirc it was an fp16 model checkpoint which we converted to int8 for inference, so I assume 140GB?

                • logicallee 10 months ago

                  Thanks, I finished downloading it (which took many hours) onto an external hard drive (by adding a HF_HOME environmental variable for where to store that cache). Its size was 262 GB.

          • 999900000999 10 months ago

            Any chance that the new NPUs are going to significantly speed up running these locally.

            Well I'm definitely worried about recall and all the Microsoft nonsense, I really want to be able to run and train LMMs, and other machine learning frameworks locally.

            • irusensei 10 months ago

              You still need lots of fast memory.

            • Hugsun 10 months ago

              Abysmal article. It doesn't explain anything about the claim in the title. Is there quantization? How much RAM do you need? How fast is the inference? None of these questions are addressed or even mentioned.

              > Of course, it would be more reasonable to compare the similarly sized 400B models with GPT4 and Claude3 Opus

              No. It's completely irrelevant to the topic of the article.

              The article is mostly a press release for llama 3. It also contains a few comments by the author, they aren't bad but don't save the clickbaity, buzzy, sensationalist core.

              • bionhoward 10 months ago

                Llama isn’t open source because the license says you can only use it to improve itself, so the title is false

                • exe34 10 months ago

                  You could use it to earn money to spend on GPU to improve llama...

                  • undefined 10 months ago
                    [deleted]
                • andrewmcwatters 10 months ago

                  This is probably going to sound silly, but I wonder how it compares to TinyLlama and others.

                  • fexelein 10 months ago

                    As a cloud solution developer that has to build AI on Azure I have been using this instead of Azure OpenAI. It has sped up my development workflow a lot, and for my purposes it’s comparable enough. I’m using LM studio to load these models.

                    • isoprophlex 10 months ago

                      Can you expand a bit -- because the AOAI is so slow? What exactly helps you speed things up?

                      • fexelein 10 months ago

                        On my machine, I am able to create a prompt that suits my need and chat with the model in realtime. With 100% GPU offload, it replies within half a second. LM studio provides an OpenAI compatible api endpoint for my Dotnet software to use. This boosts my developer experience significantly. The Azure services are slow and if you want to regenerate a serie of responses (e.g part of conversation flow) it just takes too long. On my local machine I also do not worry about cloud costs.

                        As a bonus; I also use this for a personal project where I use prompts and Llama3 to control smart devices. JSON responses from the LLM are parsed and translated into the smart device commands from a raspberry pi. I control it using speech via my Apple Watch and Apple shortcuts to the raspberry pi’s api. It all works magically and fast. Way faster than pulling up the app on my phone. And yes the LLM is smart enough to control groups of devices using simple conversational AI.

                        edit; here's a demo https://www.youtube.com/watch?v=dCN1AnX8txM

                    • kouru225 10 months ago

                      is it possible to use this for audio transcription?

                      • undefined 10 months ago
                        [deleted]
                        • 1GZ0 a year ago

                          This sounds like a game changer. I wonder if they need to do a tonne of specific work per model? If this could be implemented in Ollama, I'd be over the moon.

                          • nutrientharvest 10 months ago

                            Ollama can already run Llama-3 70B with a 4GB GPU, or no GPU at all, it'll just be slow.

                            Considering this says it's "not designed for real-time interactive scenarios" it's probably also really slow

                            • cpill 10 months ago

                              so how much GPU RAM does need to get the 70B going fast (ish)?

                              • AaronFriel 10 months ago

                                A good rule of thumb is that models can be quantized to 6 to 8 bits per weight without significantly degrading quality. This is convenient for the math: 70GB plus some overhead for the attention matrices (ongoing requests). This overhead depends on workload and context lengths, but you should expect about 30% more. So, around 100GB for a server under load.

                            • programd 10 months ago

                              llama3:70b using llama.cpp (used under the hood by Ollama) on a 11th Gen Intel i5-11400 @ 2.60GHz - no GPU, CPU inference only.

                              "Write a haiku about Hacker News mentioning AI in the title"

                              Here is a haiku:

                                AI whispers secrets
                                HN threads weave tangled debate
                                Intelligence born
                              
                                eval time = 30363.04 ms / 23 runs ( 1320.13 ms per token, 0.76 tokens per second)
                                total time = 34294.80 ms / 33 tokens
                              • bityard 10 months ago

                                That really doesn't seem bad. When people talk about responses of self-hosted LLMs without a beefy GPU being unusably slow, I always assumed they meant 15 minutes to hours. I do not mind waiting a few minutes if it will summarize the answer a question that will take me many times longer to research.

                                • logicallee 10 months ago

                                  how much disk space did it use?