If I understand the article correctly, any sufficiently capable attacker can:
- Know the global state of your GPU cluster via the client.
- Target the most struggling GPU instances specifically since the client decides which one to hit.
You offer a free tier which means anyone can get an account and try to do it (e.g. you can have one "harmless, mostly inactive" free account with the only purpose of retrieving GPU cluster status, and a bunch of burner accounts to overload struggling instances).
I may be completely wrong, but this sounds like DDoS served on a silver plate to me.
They run these clients themselves and the redis instance isn't publically exposed.
It would indeed be very strange to hope your random users coordinate with your client side load balancer. You wouldn't even have to send real traffic. You could just manipulate redis directly to force all the real traffic to go to a single node. DoSing redis itself is also pretty easy.
I don't think the article implied that the client was for some sort of internal server-to-server communication, or that the Redis instance was directly exposed to the internet.
So no, I don't think they run these clients themselves. If the code runs out there, it's open to inspection.
Either way, you are right to point out that it important to only a try a pattern like this if your clients are highly trusted (or/and have additional compensating controls against DDOS threats). It would be beneficial if the OP made more explicit what their client/server relationships and also flagged the risk you mentioned for general audiences not to go implementing such a solution in the wrong places.
Author here. We were hitting tail latency and low GPU utilization issues serving SLMs via Triton.
I built a scrappy client-side router using Redis and Lua to track real-time GPU load. It boosted utilization by ~40% and improved latencies.
Happy to hear feedback on the implementation or thoughts on better ways to do this!
Have you tried switching it to a job queue where the GPU instances try to keep themselves busy. That way you can auto scale the gpus based on utilization. I find it easier to tune and you can monitor latency and backlogs easier. It does require some async mechanisms to the client but I have found it easier to maintain