• Rzor a year ago

    From the article: Please note: it’s not designed for real-time interactive scenarios like chatting, more suitable for data processing and other offline asynchronous scenarios. Repo: https://github.com/lyogavin/Anima/tree/main/air_llm

    • Gloomily3819 a year ago

      What a misleading article. I thought they'd done some breakthrough in resource efficiency. This is just the old and slow method tools like Ollama used.

      • logicallee a year ago

        Do you know how much disk space this takes total? When I ran it, it downloaded nearly 30 gigabytes of models and seemed to be on track to download 28 more 5 gigabyte chunks (for a total of 150 gigabytes of disk space or maybe more). What is the total size before it finishes?

        • lostmsu a year ago

          70B parameters * 2 bytes each (fp16 or bf16) = 140GB

          I wish models sizes were published in bytes.

          • logicallee a year ago

            Thanks, I finished downloading it (which took many hours) onto an external hard drive (by adding a HF_HOME environmental variable for where to store that cache). Its size was 262 GB.

        • gavmor a year ago

          What method is that? Layer offloading?

          • Hugsun a year ago

            Yes, it's either that, or CPU inference. The article doesn't say.

            It doesn't mention quantization either.

        • 0cf8612b2e1e a year ago

          Any sense of speed? My assumption is that shuttling the weights in/out of the GPU is slow. Does GPU load + processing beat an entirely CPU solution? Doubly so if it is a huge model where the model cannot sit fully in RAM?

          • p1esk a year ago

            Depends on your CPU. I once tried 70b llama on 256 thread Epyc, it was around 1/10 of A100 (80GB) speed.

            • logicallee a year ago

              how much disk space did it use?

              • p1esk a year ago

                I didn’t check, but iirc it was an fp16 model checkpoint which we converted to int8 for inference, so I assume 140GB?

                • logicallee a year ago

                  Thanks, I finished downloading it (which took many hours) onto an external hard drive (by adding a HF_HOME environmental variable for where to store that cache). Its size was 262 GB.

          • 999900000999 a year ago

            Any chance that the new NPUs are going to significantly speed up running these locally.

            Well I'm definitely worried about recall and all the Microsoft nonsense, I really want to be able to run and train LMMs, and other machine learning frameworks locally.

            • irusensei a year ago

              You still need lots of fast memory.

            • Hugsun a year ago

              Abysmal article. It doesn't explain anything about the claim in the title. Is there quantization? How much RAM do you need? How fast is the inference? None of these questions are addressed or even mentioned.

              > Of course, it would be more reasonable to compare the similarly sized 400B models with GPT4 and Claude3 Opus

              No. It's completely irrelevant to the topic of the article.

              The article is mostly a press release for llama 3. It also contains a few comments by the author, they aren't bad but don't save the clickbaity, buzzy, sensationalist core.

              • bionhoward a year ago

                Llama isn’t open source because the license says you can only use it to improve itself, so the title is false

                • exe34 a year ago

                  You could use it to earn money to spend on GPU to improve llama...

                  • undefined a year ago
                    [deleted]
                • andrewmcwatters a year ago

                  This is probably going to sound silly, but I wonder how it compares to TinyLlama and others.

                  • fexelein a year ago

                    As a cloud solution developer that has to build AI on Azure I have been using this instead of Azure OpenAI. It has sped up my development workflow a lot, and for my purposes it’s comparable enough. I’m using LM studio to load these models.

                    • isoprophlex a year ago

                      Can you expand a bit -- because the AOAI is so slow? What exactly helps you speed things up?

                      • fexelein a year ago

                        On my machine, I am able to create a prompt that suits my need and chat with the model in realtime. With 100% GPU offload, it replies within half a second. LM studio provides an OpenAI compatible api endpoint for my Dotnet software to use. This boosts my developer experience significantly. The Azure services are slow and if you want to regenerate a serie of responses (e.g part of conversation flow) it just takes too long. On my local machine I also do not worry about cloud costs.

                        As a bonus; I also use this for a personal project where I use prompts and Llama3 to control smart devices. JSON responses from the LLM are parsed and translated into the smart device commands from a raspberry pi. I control it using speech via my Apple Watch and Apple shortcuts to the raspberry pi’s api. It all works magically and fast. Way faster than pulling up the app on my phone. And yes the LLM is smart enough to control groups of devices using simple conversational AI.

                        edit; here's a demo https://www.youtube.com/watch?v=dCN1AnX8txM

                    • kouru225 a year ago

                      is it possible to use this for audio transcription?

                      • undefined a year ago
                        [deleted]
                        • 1GZ0 a year ago

                          This sounds like a game changer. I wonder if they need to do a tonne of specific work per model? If this could be implemented in Ollama, I'd be over the moon.

                          • nutrientharvest a year ago

                            Ollama can already run Llama-3 70B with a 4GB GPU, or no GPU at all, it'll just be slow.

                            Considering this says it's "not designed for real-time interactive scenarios" it's probably also really slow

                            • cpill a year ago

                              so how much GPU RAM does need to get the 70B going fast (ish)?

                              • AaronFriel a year ago

                                A good rule of thumb is that models can be quantized to 6 to 8 bits per weight without significantly degrading quality. This is convenient for the math: 70GB plus some overhead for the attention matrices (ongoing requests). This overhead depends on workload and context lengths, but you should expect about 30% more. So, around 100GB for a server under load.

                            • programd a year ago

                              llama3:70b using llama.cpp (used under the hood by Ollama) on a 11th Gen Intel i5-11400 @ 2.60GHz - no GPU, CPU inference only.

                              "Write a haiku about Hacker News mentioning AI in the title"

                              Here is a haiku:

                                AI whispers secrets
                                HN threads weave tangled debate
                                Intelligence born
                              
                                eval time = 30363.04 ms / 23 runs ( 1320.13 ms per token, 0.76 tokens per second)
                                total time = 34294.80 ms / 33 tokens
                              • bityard a year ago

                                That really doesn't seem bad. When people talk about responses of self-hosted LLMs without a beefy GPU being unusably slow, I always assumed they meant 15 minutes to hours. I do not mind waiting a few minutes if it will summarize the answer a question that will take me many times longer to research.

                                • logicallee a year ago

                                  how much disk space did it use?