Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major FP32 llm.c improvements/refactoring/etc. #696

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ademeure
Copy link
Contributor

I got slightly carried away and this ended up significantly changing nearly every single kernel in train_gpt2_fp32.cu! I have also added a lot of comments to the kernels - possibly too many, but if the FP32 version is meant to be strictly educational then hopefully most/some of them are useful.

The biggest difference is that several kernels use the new "row_reduction()" function to abstract away some of the computation in a way that makes intuitive sense to me, while remaining much less general than something like PyTorch which supports that over every axis and in every way etc... the row_reduction code is actually pretty tiny, and personally I feel it makes it more obvious rather than less what's going on. It does require a little bit "more C++" but nothing too crazy (in my opinion). The kernels that don't use row_reduction() are also significantly simplified.

Another big change is I've added a TF32 matrix multiplication kernel based on NVIDIA's CUDA sample at https://github.com/NVIDIA/cuda-samples/tree/v12.4/Samples/3_CUDA_Features/tf32TensorCoreGemm - unfortunately, it's still MUCH slower than cuBLAS (80K T/s for cuBLAS, 50K T/s for TF32, 40K T/s for FP32... and that's only replacing the forward matmul kernel, performance would be so much worse if all matrix multiplies were using our custom matmuls). So, I decided to add back cuBLAS (non-Lt so non-fused bias) as the new default, while keeping both the current default FP32 and new TF32 matmul forward kernels available on the command line ("-c 1" to use the new TF32 kernel).

In terms of changes that might be useful for the non-FP32 version: I think the backward bias kernel in the current llm.c master is overcomplicated, and the approach I came up with for this new FP32 version might even be faster, so I'm planning to modify it to work with train_gpt2.cu later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant