• vessenes an hour ago

    This is not a memory reduction technique that's somehow magical. Well, it does manage memory with some clever scheduling. The core of this idea is that you can schedule out inference on edge nodes in a memory and bandwidth optimized way that's a bit different than just splitting layers.

    They propose that right now computation and latency dominate the costs for multi-node inference, and pick a network topology (star) that is savvy to that.

    That said, it's 26-29 seconds per token for llama2-70b with their 8 edge devices, each using 4 gigs of RAM. That's amazing that they can run it at all, but this isn't going to be viable at the edge with current hardware.

    I think the paper makes the case that you could probably recruit say your 30 graphics workstations to do much faster inference without just nailing your LAN bandwidth, though.

    Upshot - interesting paper -- smart ideas, large frontier models still need very exotic hardware and bandwidth interconnects - this may point a way forward to lowering the bandwidth interconnects part of the story.

    • tgtweak 29 minutes ago

      I think the main advantage here is you COULD run it, even it it takes a while. That is a step up from current model limitations which require ram or vram to hold the model.

      I think this lays some groundwork for running a 400B model on a 3090/4090 or even smaller GPU. If you can get a huge model like that running on a single gpu even if the mean time per token is in the seconds, that's acceptable for many use cases.

      If this same technique can be used to extend context windows in addition to token autocomplete, that would be great in it's own regard.

      Hopefully work like this continues as throwing a ton of vram at a model should be regarded as a performance optimization not necessarily a requirement.

      • alchemist1e9 26 minutes ago

        > I think the paper makes the case that you could probably recruit say your 30 graphics workstations to do much faster inference without just nailing your LAN bandwidth, though.

        Could be a big deal if it allows cluster of smaller GPUs to compete with a single large VRAM GPU.

        Unfortunately I’m a few months of date - which is an eternity in LLM inference techniques - so I’m not sure what current state of distributed inference looks like.

      • loufe 2 hours ago

        It would be nice for the inference time to be paired with measure of output quality. I'm not well versed in how the architecture works, but I have a hard time believing a 90% reduction in peak memory footprint comes cost-free.

        • woadwarrior01 2 hours ago

          It's not cost-free. It comes at the cost of greatly increased latency. 29.9 seconds per token with Llama 3.1-70B. This is from Table 1 (pg 8) of the paper.

          • _ache_ an hour ago

            That is s/token and not token/s. The cost is high.

            The actual goal of the article is to highlight that we can optimise the overall speed by decreasing link latency. Yeah link latency, because it's not 1 machine but several low devices that are used together to serve the 70B LLM.

            • m3kw9 an hour ago

              Ah the disk swap method

              • _ache_ an hour ago

                It's not disk swap. It's multi-devices LLM.

                • thelastparadise an hour ago

                  Is there any predictability/patterns for neuron/layer activation? If so, would it be reasonable to have a second tiny model that specifically tries to predict activation and preemptively swap those into memory?

                  • miki123211 an hour ago

                    This isn't how neural networks work.

                    For vanilla models, you always use all the weights. That isn't true for mixture-of-experts, though, and in that setting, your approach has merit.

                    • tcdent an hour ago

                      Depends on the architecture, but generally you just move through the layers linearly. Simple iteration.

                      The number of layers, and the amount of time spent in each of them, makes me think any benefit from pre-loading the layer ahead is negligible.

                      You really need the entire model on device to consider it performant.

                • freehorse 2 hours ago

                  From what I get skimming through the article the main cost is speed of token generation (token latency). You can always run a large model by reading directly from the disk and not care much about ram; but it is very slow. They try to improve that aspect doing some optimisations, but it is still definitely slower than using ram or vram.

                  • refulgentis an hour ago

                    Table 3 directly refutes this* and claims 0 tradeoffs.**

                    Below that, they indicate that a key part of the implementation is loading weights from disk before they're needed using a separate thread.***

                    * maybe I'm missing something though, someone please triple check :)

                    ** ttft (time to first token) and s/token (seconds per token) are both lower than any alternative in all cases.

                    *** "daemon thread asynchronously preloads the weights"

                    • sgc 39 minutes ago

                      I want to add that their chart shows s/token per device (edit: as per the heading on table 1 - it could also be confusing grammar), so it sounds like you are getting 4x the listed s/t on their 4 laptop cluster . Their laptops are not even hardwired - they are connecting over wifi.

                      This comes at a very interesting time for me. I have an ancient dual xeon workstation with 64gb memory that I was researching how to convert to run an llm. I can just run that with 4 instances on the same machine and see how it goes, without purchasing a better GPU, to start. It sounds like this will allow you to run very large models with minimal quants, on craigslist quality devices.

                      If it does what they say it does (and it seems to do), it will be an absolute game changer for most users.

                  • zackangelo 2 hours ago

                    I've only read the abstract but they don't mention quantizing the weights or otherwise trying to shrink the model in any way.

                    They're claiming to be able to efficiently run larger models without loading the entire thing into GPU memory. If they're using the same weights, the same architecture and just using tensor parallel operations to perform the forward pass that would imply no loss in quality.

                    I'm sure there are trade-offs but they're not clear by just looking at the abstract.

                    • tgtweak 26 minutes ago

                      I read it like this too - no drop in weights or model quality just optimizing the lower boundaries of performance when you are splitting from vram to ram to disk (or network).

                    • not_a_dane 2 hours ago

                      Nothing is free in this world.

                    • Zetaphor 42 minutes ago

                      Is this different from (or related to) the work being done by the exo project?

                      https://github.com/exo-explore/exo

                      • tgtweak 25 minutes ago

                        Exo is for partitioning over network across devices (implementing some bandwidth-reducing partitions) but still requires a minimum ram/vram requirement to load a model. This could, in theory, be combined to allow larger models to run on exo clusters with less gpu/ram than is required by the underlying model (at the cost of some performance no doubt, but still).

                      • tgtweak 32 minutes ago

                        Is there a cuda implementation of this... asking for a friend

                        • dvh an hour ago

                          So when will I be able to "sudo apt-get install llm" ?

                          • o11c 34 minutes ago

                            Realistically, you probably want to wait until Vulkan support trickles out. That way, you aren't at the whim of the various evil hardware drivers (everybody's suck), and the AI can give you a disappointingly confused answer much faster than running the LLM on a CPU can.

                            • yjftsjthsd-h an hour ago

                              I'm not aware of any Debian family distro that packages it, but NixOS has at least ollama and llama-cpp in its repos. Honestly even if the more stable distributions did have these things packaged, I would hesitate to use the packaged versions because all of this stuff is still so quickly moving that you'd be on an old version and it would hurt.

                              Edit: Arch has ollama in official repos too. OpenSUSE has https://software.opensuse.org/package/ollama .

                              • mysterhawk an hour ago

                                You can already do it with llamafile, checkout the project, it lets you convert a .gguf model in a portable executable

                              • paxys an hour ago

                                You already can with ollama

                                • jsanders9 an hour ago

                                  Ollama is close...