Cannot run LLM models on GPU built with GGML_HIPBLAS [AMD GPU | Ubuntu 22.04.4.] #9139

Allan-Luu · 2024-08-23T03:56:14Z

Allan-Luu
Aug 23, 2024

Hello, i'm trying to perform LLM inference utilizing multiple combinations of CPUs and/or AMD GPUs. Currently I'm at a point where I can properly output a prompt with llama.cpp using the CPUs (not controlling the number of CPUs utilized), but i'm running into a lot of issues when attempting to offload to GPUs.

What works

When following the readme, without the GPU offload feature I built the llama.cpp with this line input to the command prompt:
make
and this line will generate a proper response:
./llama-cli -m /data/models/llama-2-7b.Q4_K_M.gguf -p "can you tell me about jacket potatoes?" -n 128

What Doesn't work

Compiling
When attempting to utilize the BLAS acceleration on HIP-supported AMD GPUs following the readme's hipBLAS section, I've attempted the following build the cli/server with the following lines:

make -j16 GGML_HIPBLAS=1 GGML_HIP_UMA=1 AMDGPU_TARGETS=gfx1030

or

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \ cmake -S . -B build -DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \ && cmake --build build --config Release -- -j 16

Errors
With these build files, I've tried multiple iterations of generating the prompts, adding export HSA_OVERRIDE_GFX_VERSION=10.3.0 or 11.0.1 as suggested in SJTU-IPADS/PowerInfer#139 for compiling issues, tried adding export HIP_VISIBLE_DEVICES=x to choose which GPU to offload to, and added -ngl x for the number of layers I want to offload to the GPU. An example command looks like below:

export HSA_OVERRIDE_GFX_VERSION=11.0.1 && export HIP_VISIBLE_DEVICES=3 && ./llama-cli -m /data/models/llama-2-7b.Q4_K_M.gguf -p "can you tell me about jacket potatoes?" -n 128 -ngl 30

These attempts at offloading to the GPU will give me either Segmentation fault (core dumped), CUDA error: out of memory saying Could not attach to process. If your uid matches the uid of the target process check the setting of /proc/sys/kernel/yama/ptrace_scope.

Example of segmentation fault after running export
HSA_OVERRIDE_GFX_VERSION=11.0.1 && export HIP_VISIBLE_DEVICES=3 && sudo ./llama-cli -m /data/models/llama-2-7b.Q4_K_M.gguf -p "can you tell me about jacket potatoes?" -n 128 -ngl 30

Any help or guidance would be greatly appreciated!

Answered by 8XXD8

Aug 27, 2024

I can run any llama.cpp supported model on a singe GPU, multi GPU, or GPU and partial CPU offload, without any issue.
In my system I have 3X Radeon Pro VIIs and a single MI25, I had to add export HSA_ENABLE_SDMA=0 to .bashrc to get the MI25 working with Rocm 6+, but that shouldn't be necessary for the MI50s.

I installed Rocm with amdgpu-install, no docker, and llama.cpp works fine, though I'm on a much newer kernel.

Are you still getting segmentation faults, or it is just CUDA error: out of memory?
By default llama.cpp puts the context on the first gpu but spreads the layers between the available GPUs, and your model has a 32k context size, it is possible that your are out of memory on a…

View full answer

8XXD8 · 2024-08-25T10:09:59Z

8XXD8
Aug 25, 2024

You are using the wrong target, the MI50 is gfx906.
GGML_HIP_UMA=1 is for iGPUs, it can cause a huge slowdown.
I have 3 Radeon PRO VIIs, same arch, and they work fine, I just had to play around with the parameters like -fa and --no-mmap to get the most out of them.

7 replies

Allan-Luu Aug 26, 2024
Author

hmm is there any other way to verify the GPU specs through command line? I'm using rocminfo and it pulls this formation for me. rocm-smi will tell me it's Radeon Instinct MI50 and lspci says my GPUs are Radeon Pro VII/Radeon Instinct MI50 32GB.

My rocm installed is currently at 6.0.2, and ubuntu is running kernel version 6.5.0-44-generic

Tried to make without specifying target for build, but that doesn't change much unfortunately, I'm getting the same output. When I recompile my program, I make sure I have a clean slate by deleting the entire llama.cpp and recloning it to make sure there's nothing else affecting my compiling.

Thank you again for your inputs! I'll test out the GPU offloading variations once I get this working

8XXD8 Aug 26, 2024

I'm running Rocm 6.2 with the 6.10.3 kernel, and here are the lspci and rocminfo outputs
lspci
83:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon Pro VII/Radeon Instinct MI50 32GB] (rev 06)

rocminfo

Agent 2

Name: gfx906
Uuid: GPU-156c18817337ecdd
Marketing Name: AMD Radeon Pro VII
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU

I think you should install the latest Rocm, but without the Dkms driver, this is how I'm using it.

Allan-Luu Aug 27, 2024
Author

I'm seeing the GPU names show up as gfx906 now, so I do believe you when you say the GPU's were 100% supposed to be this target. Perhaps there was something wrong with the environment setup.

Following your lead, I tried to run a docker environment using this docker image stated in the ROCm documentation to see if the version/kernel have any affects on the outputs. This didn't seem to have an impact to the errors I've been receiving unfortunately. I also checked if the dkms driver was installed withdkms status, and there was no mention of anything found.

The command I used to build my container, as suggested by the documentation, was this:

sudo docker run -v {host model location}:/data/models -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 8G --name llm_optimization rocm/pytorch:rocm6.2_ubuntu20.04_py3.9_pytorch_release_2.3.0

The docker container does see the 8 GPUs connected to my system. I did notice although, when running radeontop and the command sudo ./llama-cli -m /data/models/zephyr-7b-beta.Q4_K_M.gguf -p "can you tell me about jacket potatoes?" -n 128 -fa --no-mmap -ngl 33 I'm able to see a small uptick in Graphics pipe and VRAM utilization, which both reach about 2%. So this is telling me something is attempting to run on the GPU before throwing this CUDA error: out of memory issue.

This is also my output for lspci, seems very similar to yours c5:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon Pro VII/Radeon Instinct MI50 32GB] (rev 02). Your rocminfo output looks very similar as well! Which is baffling to me since you have gotten it to work.

Is your setup able to run LLM models without any issues, regardless of the model? And was it difficult to get your setup running without errors?
I'm really stuck here so I really do appreciate all your efforts in helping me out! Thank you again.

8XXD8 Aug 27, 2024

I can run any llama.cpp supported model on a singe GPU, multi GPU, or GPU and partial CPU offload, without any issue.
In my system I have 3X Radeon Pro VIIs and a single MI25, I had to add export HSA_ENABLE_SDMA=0 to .bashrc to get the MI25 working with Rocm 6+, but that shouldn't be necessary for the MI50s.

I installed Rocm with amdgpu-install, no docker, and llama.cpp works fine, though I'm on a much newer kernel.

Are you still getting segmentation faults, or it is just CUDA error: out of memory?
By default llama.cpp puts the context on the first gpu but spreads the layers between the available GPUs, and your model has a 32k context size, it is possible that your are out of memory on a single GPU.

Try limiting the context with -c 1024, or spread it between the GPUs with -sm row

Answer selected by Allan-Luu

Allan-Luu Aug 29, 2024
Author

I was able to solve the issue by reinstalling/updating ROCm with amdgpu-install and it seemed to help! I'm not able to run llama.cpp with the models i was having issues with earlier on a single GPU, multiple GPU and partial CPU offload 😄 Thanks again for all your help @8XXD8 . I was running into some errors on my main machine but the docker container mentioned above is perfectly fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot run LLM models on GPU built with GGML_HIPBLAS [AMD GPU | Ubuntu 22.04.4.] #9139

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Cannot run LLM models on GPU built with GGML_HIPBLAS [AMD GPU | Ubuntu 22.04.4.] #9139

Allan-Luu Aug 23, 2024

What works

What Doesn't work

Replies: 1 comment · 7 replies

8XXD8 Aug 25, 2024

Allan-Luu Aug 26, 2024 Author

8XXD8 Aug 26, 2024

Allan-Luu Aug 27, 2024 Author

8XXD8 Aug 27, 2024

Allan-Luu Aug 29, 2024 Author

Allan-Luu
Aug 23, 2024

Replies: 1 comment 7 replies

8XXD8
Aug 25, 2024

Allan-Luu Aug 26, 2024
Author

Allan-Luu Aug 27, 2024
Author

Allan-Luu Aug 29, 2024
Author