Major FP32 llm.c improvements/refactoring/etc. #696

ademeure · 2024-07-18T18:29:26Z

I got slightly carried away and this ended up significantly changing nearly every single kernel in train_gpt2_fp32.cu! I have also added a lot of comments to the kernels - possibly too many, but if the FP32 version is meant to be strictly educational then hopefully most/some of them are useful.

The biggest difference is that several kernels use the new "row_reduction()" function to abstract away some of the computation in a way that makes intuitive sense to me, while remaining much less general than something like PyTorch which supports that over every axis and in every way etc... the row_reduction code is actually pretty tiny, and personally I feel it makes it more obvious rather than less what's going on. It does require a little bit "more C++" but nothing too crazy (in my opinion). The kernels that don't use row_reduction() are also significantly simplified.

Another big change is I've added a TF32 matrix multiplication kernel based on NVIDIA's CUDA sample at https://github.com/NVIDIA/cuda-samples/tree/v12.4/Samples/3_CUDA_Features/tf32TensorCoreGemm - unfortunately, it's still MUCH slower than cuBLAS (80K T/s for cuBLAS, 50K T/s for TF32, 40K T/s for FP32... and that's only replacing the forward matmul kernel, performance would be so much worse if all matrix multiplies were using our custom matmuls). So, I decided to add back cuBLAS (non-Lt so non-fused bias) as the new default, while keeping both the current default FP32 and new TF32 matmul forward kernels available on the command line ("-c 1" to use the new TF32 kernel).

In terms of changes that might be useful for the non-FP32 version: I think the backward bias kernel in the current llm.c master is overcomplicated, and the approach I came up with for this new FP32 version might even be faster, so I'm planning to modify it to work with train_gpt2.cu later.

…row() and TF32 matmul)

…n_gpt2fp32cu

ademeure added 4 commits July 18, 2024 04:13

major improvements & cleanup to train_gpt2_fp32.cu (including reduce_…

1951917

…row() and TF32 matmul)

Improve flexibility to choose between cuBLAS, TF32, and FP32 for trai…

c8e3766

…n_gpt2fp32cu

even more comments for CUDA kernels in train_gpt2_fp32.cu

ca50e2f

(mostly) fix formatting/identation of compute_tf32gemm_async_copy

f866821

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major FP32 llm.c improvements/refactoring/etc. #696

Major FP32 llm.c improvements/refactoring/etc. #696

ademeure commented Jul 18, 2024

Major FP32 llm.c improvements/refactoring/etc. #696

Are you sure you want to change the base?

Major FP32 llm.c improvements/refactoring/etc. #696

Conversation

ademeure commented Jul 18, 2024