Comments Page - Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

« Back Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPUai.gopubby.comSubmitted by qiakai a year ago

Rzor a year ago
From the article: Please note: it’s not designed for real-time interactive scenarios like chatting, more suitable for data processing and other offline asynchronous scenarios. Repo: https://github.com/lyogavin/Anima/tree/main/air_llm
Gloomily3819 a year ago
What a misleading article. I thought they'd done some breakthrough in resource efficiency. This is just the old and slow method tools like Ollama used.
- logicallee a year ago
  Do you know how much disk space this takes total? When I ran it, it downloaded nearly 30 gigabytes of models and seemed to be on track to download 28 more 5 gigabyte chunks (for a total of 150 gigabytes of disk space or maybe more). What is the total size before it finishes?
  lostmsu a year ago
  70B parameters * 2 bytes each (fp16 or bf16) = 140GB
  I wish models sizes were published in bytes.
  logicallee a year ago
  Thanks, I finished downloading it (which took many hours) onto an external hard drive (by adding a HF_HOME environmental variable for where to store that cache). Its size was 262 GB.
- gavmor a year ago
  What method is that? Layer offloading?
  Hugsun a year ago
  Yes, it's either that, or CPU inference. The article doesn't say.
  It doesn't mention quantization either.
0cf8612b2e1e a year ago
Any sense of speed? My assumption is that shuttling the weights in/out of the GPU is slow. Does GPU load + processing beat an entirely CPU solution? Doubly so if it is a huge model where the model cannot sit fully in RAM?
- p1esk a year ago
  Depends on your CPU. I once tried 70b llama on 256 thread Epyc, it was around 1/10 of A100 (80GB) speed.
  logicallee a year ago
  how much disk space did it use?
  p1esk a year ago
  I didn’t check, but iirc it was an fp16 model checkpoint which we converted to int8 for inference, so I assume 140GB?
  logicallee a year ago
  Thanks, I finished downloading it (which took many hours) onto an external hard drive (by adding a HF_HOME environmental variable for where to store that cache). Its size was 262 GB.
999900000999 a year ago
Any chance that the new NPUs are going to significantly speed up running these locally.
Well I'm definitely worried about recall and all the Microsoft nonsense, I really want to be able to run and train LMMs, and other machine learning frameworks locally.
- irusensei a year ago
  You still need lots of fast memory.
Hugsun a year ago
Abysmal article. It doesn't explain anything about the claim in the title. Is there quantization? How much RAM do you need? How fast is the inference? None of these questions are addressed or even mentioned.
> Of course, it would be more reasonable to compare the similarly sized 400B models with GPT4 and Claude3 Opus
No. It's completely irrelevant to the topic of the article.
The article is mostly a press release for llama 3. It also contains a few comments by the author, they aren't bad but don't save the clickbaity, buzzy, sensationalist core.
bionhoward a year ago
Llama isn’t open source because the license says you can only use it to improve itself, so the title is false
- exe34 a year ago
  You could use it to earn money to spend on GPU to improve llama...
  undefined a year ago
  [deleted]
andrewmcwatters a year ago
This is probably going to sound silly, but I wonder how it compares to TinyLlama and others.
fexelein a year ago
As a cloud solution developer that has to build AI on Azure I have been using this instead of Azure OpenAI. It has sped up my development workflow a lot, and for my purposes it’s comparable enough. I’m using LM studio to load these models.
- isoprophlex a year ago
  Can you expand a bit -- because the AOAI is so slow? What exactly helps you speed things up?
  fexelein a year ago
  On my machine, I am able to create a prompt that suits my need and chat with the model in realtime. With 100% GPU offload, it replies within half a second. LM studio provides an OpenAI compatible api endpoint for my Dotnet software to use. This boosts my developer experience significantly. The Azure services are slow and if you want to regenerate a serie of responses (e.g part of conversation flow) it just takes too long. On my local machine I also do not worry about cloud costs.
  As a bonus; I also use this for a personal project where I use prompts and Llama3 to control smart devices. JSON responses from the LLM are parsed and translated into the smart device commands from a raspberry pi. I control it using speech via my Apple Watch and Apple shortcuts to the raspberry pi’s api. It all works magically and fast. Way faster than pulling up the app on my phone. And yes the LLM is smart enough to control groups of devices using simple conversational AI.
  edit; here's a demo https://www.youtube.com/watch?v=dCN1AnX8txM
kouru225 a year ago
is it possible to use this for audio transcription?
undefined a year ago
[deleted]
1GZ0 a year ago
This sounds like a game changer. I wonder if they need to do a tonne of specific work per model? If this could be implemented in Ollama, I'd be over the moon.
- nutrientharvest a year ago
  Ollama can already run Llama-3 70B with a 4GB GPU, or no GPU at all, it'll just be slow.
  Considering this says it's "not designed for real-time interactive scenarios" it's probably also really slow
  cpill a year ago
  so how much GPU RAM does need to get the 70B going fast (ish)?
  AaronFriel a year ago
  A good rule of thumb is that models can be quantized to 6 to 8 bits per weight without significantly degrading quality. This is convenient for the math: 70GB plus some overhead for the attention matrices (ongoing requests). This overhead depends on workload and context lengths, but you should expect about 30% more. So, around 100GB for a server under load.
- programd a year ago
  llama3:70b using llama.cpp (used under the hood by Ollama) on a 11th Gen Intel i5-11400 @ 2.60GHz - no GPU, CPU inference only.
  "Write a haiku about Hacker News mentioning AI in the title"
  Here is a haiku:
  AI whispers secrets HN threads weave tangled debate Intelligence born eval time = 30363.04 ms / 23 runs ( 1320.13 ms per token, 0.76 tokens per second) total time = 34294.80 ms / 33 tokens
  bityard a year ago
  That really doesn't seem bad. When people talk about responses of self-hosted LLMs without a beefy GPU being unusably slow, I always assumed they meant 15 minutes to hours. I do not mind waiting a few minutes if it will summarize the answer a question that will take me many times longer to research.
  logicallee a year ago
  how much disk space did it use?