How to use GPUs over multiple computers for local AI?
The problem is simple: consumer motherboards don't have that many PCIe slots, and consumer CPUs don't have enough lanes to run 3+ GPUs at full PCIe gen 3 or gen 4 speeds.
My idea was to buy 3-4 computers for cheap, slot a GPU into each of them and use 4 of them in tandem. I imagine this will require some sort of agent running on each node which will be connected through a 10Gbe network. I can get a 10Gbe network running for this project.
Does Ollama or any other local AI project support this? Getting a server motherboard with CPU is going to get expensive very quickly, but this would be a great alternative.
A 10 Gbps network is MUCH slower than even the smallest oldest PCIe slot you have. So cramming the GPUs in any old slot that'll fit is a much better option than distributing it over multiple PCs.
I agree with the idea of not using a 10 Gbps network for GPU work. Just one small nitpick: PCIe Gen 1 in an x1 slot is only capable of 2.5 GTransfers/sec, which translates to about 2 GBits/sec, making it about 5x slower than a 10 Gbps line-rate network.
I sincerely hope OP is not running modern AI work on a mobo with only Gen 1...
Thanks for the comment. I don't want to use a networked distributed cluster for AI if I can help it. I'm looking at other options and maybe I'll find something
Your point is valid. Originally I was looking for deals on cheap CPU + Motherboard combos that will offer me a lot of PCIe and won't be very expensive, but I couldn't find anything good for EPYC. I am now looking for used supermicro motherboards and maybe I can get something I like. I don't want to do networking for this project either but it was the only idea I could think of a few hours back
consumer motherboards don’t have that many PCIe slots
The number of PCIe slots isn't the most limiting factor when it comes to consumer motherboards. It's the number of PCIe lanes that are supported by your CPU and the motherboard has access to.
It's difficult to find non-server focused hardware that can do something like this because you need a significant number of PCIe lanes to accommodate your CPU, and your several GPUs at full speed. Using an M.2 SSD? Even more difficult.
Your 1 GPU per machine is a decent approach. Using a Kubernetes cluster with device plugins is likely the best way to accomplish what you want here. It would involve setting up your cluster, installing the drivers for your GPU (on each node) which then exposes the device to the system. Then when you create your Ollama container, in the prestart hook, ensure you expose your GPUs to the container for usage.
The issue with doing this, is 10Gbe is very slow compared to your GPU via PCIe. You're networking all these GPUs to do some cool stuff, but then you're severely bottle-necking yourself with your network. All in all, it's not a very good plan.
I agree with your assessment. I was indeed going to run k8s, just hadn't figured out what you told me. Thanks for that.
And yes, I realised that 10Gbe is just not enough for this stuff. But another commenter told me to look for used threadripper and EPYC boards (which are extremely expensive for me), which gave me the idea to look for older Intel CPU+Motherboard combos. Maybe I'll have some luck there. I was going to use Talos in a VM with all the GPUs passed through to it.
It's a way to do distributed parallel computing using consumer-grade hardware. I don't actually know a ton about them, so you'd be better served by looking up information about them.
May take a look at systems with the newer AMD SoC's first. They utilize the systems' RAM and come with a proper NPU, once ollama or mistral.rs are supporting those they might give you sufficient performance for your needs for way lower costs (incl. power consumption). Depending on how NPU support gets implemented it might even become possible to use NPU and GPU in tandem, that would probably enable pretty powerful models to be run on consumer-grade hardware at reasonable speed.
This is false: Mistral small 24b at q4_K_M quantization is 15GB. q8 is 26GB. A 3090/4090/5090 with 24GB or two cards with 16GB (I recommend the 4060 Ti 16GB) will work fine with this model, and will work in a single computer. Like others have said, 10Gbe will be a huge bottleneck, plus it’s just simply not necessary to distribute a 24b model across multiple machines.
I assume you're talking about a CUDA implementation here. There's ways to do this with that system, and even sub-projects that expand on that. I'm mostly pointing how pointless it is for you to do this. What a waste of time and money.
Edit: others are also pointing this out, but I'm still being downvoted. Mkay.
If you want to use supercomputer software, setup SLURM scheduler on those machines. There are many tutorials how to do distributed gpu computing with slurm. I have it on my todo list. https://github.com/SchedMD/slurm https://slurm.schedmd.com/
Thanks but I'm not going to run supercomputers. I just want to run 4 GPUs separately because of inadequate PCIe lanes in a single computer to run 24B-30B models
I believe you can run 30B models on single used rtx 3090 24GB at least I run 32B deepseek-r1 on it using ollama. Just make sure you have enought ram > 24GB.
Sure, works fine for inference with tensor parallelism, USB4 / thunderbolt 4/5 is a better (40Gbit+ and already there) bet than ethernet (see distributed-llama). Trash for training / fine tuning, that needs higher inter GPU speed, or better a bigger GPU VRAM.
I know nothing technical to help you. But this guy’s YouTube video goes over random shit about using different computers. I believe he uses thunderbolt 4 to connect the systems, though. Plenty of other material on YouTube, as well.