Fine grained control of GPU offloading #7678
Replies: 1 comment
-
You will need to use the -ts, --tensor-split N0,N1,N2,... fraction of the model to offload to each GPU, comma-separated list of
proportions, e.g. 3,1 You may also need to use a few other parameter as well... -c, --ctx-size N size of the prompt context (default: 0, 0 = loaded from model)
(env: LLAMA_ARG_CTX_SIZE)
-mg, --main-gpu INDEX the GPU to use for the model (with split-mode = none), or for
intermediate results and KV (with split-mode = row) (default: 0) Consider the following GPUs... +-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A5000 Off | 00000000:01:00.0 Off | Off |
| 30% 44C P8 19W / 230W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA RTX A5000 Off | 00000000:2D:00.0 Off | Off |
| 30% 51C P8 20W / 230W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA RTX A5000 Off | 00000000:41:00.0 Off | Off |
| 30% 51C P8 21W / 230W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA RTX A5000 Off | 00000000:61:00.0 Off | Off |
| 30% 47C P8 16W / 230W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+ First, let's run Llama 3, 70B Instruct, Q8... CUDA_VISIBLE_DEVICES=0,1,2,3 ./llama-cli --interactive-first --color -ngl 999 --ctx-size 0 -sm row -m /mnt/data2/models/Meta-Llama-3-70B-Instruct.Q8_0.gguf Here we are using When the model is loaded you can see that GPU 0 is in fact using more memory as the context is located on that GPU. +-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 7974 C ./llama-cli 22076MiB |
| 1 N/A N/A 7974 C ./llama-cli 18394MiB |
| 2 N/A N/A 7974 C ./llama-cli 18392MiB |
| 3 N/A N/A 7974 C ./llama-cli 18394MiB |
+-----------------------------------------------------------------------------------------+ As the context will always use the main GPU, you will want to use +-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 9498 C ./llama-cli 18392MiB |
| 1 N/A N/A 9498 C ./llama-cli 22078MiB |
| 2 N/A N/A 9498 C ./llama-cli 18392MiB |
| 3 N/A N/A 9498 C ./llama-cli 18394MiB |
+-----------------------------------------------------------------------------------------+ Next, let's try Llama 3.1, 70B Instruct, Q8... CUDA_VISIBLE_DEVICES=0,1,2,3 ./llama-cli --interactive-first --color -ngl 999 --ctx-size 0 -sm row -m /mnt/data2/models/Meta-Llama-3-70B-Instruct.Q8_0.gguf As Llama 3.1 has a context size of ggml_backend_cuda_buffer_type_alloc_buffer: allocating 40960.00 MiB on device 0: cudaMalloc failed: out of memory With some experimentation I used CUDA_VISIBLE_DEVICES=0,1,2,3 ./llama-cli --interactive-first --color -ngl 999 -n 8192 -sm row -m /mnt/data2/models/Meta-Llama-3.1-70B-Instruct.Q8_0.gguf --main-gpu 0 --ctx-size 28672 --tensor-split 1,2,2,2 The breakdown of memory use across the GPUs is... +-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 11374 C ./llama-cli 23120MiB |
| 1 N/A N/A 11374 C ./llama-cli 20670MiB |
| 2 N/A N/A 11374 C ./llama-cli 20670MiB |
| 3 N/A N/A 11374 C ./llama-cli 21312MiB |
+-----------------------------------------------------------------------------------------+ Hopefully this is helpful. |
Beta Was this translation helpful? Give feedback.
-
Is there a way to control exactly how many layers of a model get offloaded to each GPU in a workstation with multiple GPUs?
Right now I have a workstation with 3 GPUs:
I set CUDA_VISIBLE_DEVICES="2,0,1" (Listed in order of descending PCIe bandwidth).
I have monitors connected to two of the GPUs and so as you can see the 3090s have an uneven amount of available VRAM. As far as I can tell, I can only specify a certain number of layers to offload to the GPUs and llama.cpp looks at the max VRAM available to each GPU and divides up the layers accordingly but because the GPUs don't have an even amount of available VRAM I wind up with a situation that looks like this:
There are some edge cases where llama's default layer allocation forces me to take a more aggressive quant of a model to avoid malloc errors on the GPU that has the least available VRAM compared to what I think I could manage if I could control exactly how many layers (and in what GPU order) I offload.
So in my mind it would be useful if I could specify exactly the number of layers offloaded to each GPU to fine-tune performance on my workstation - is there anyway to do this with the current codebase?
Beta Was this translation helpful? Give feedback.
All reactions