Major FP32 llm.c improvements/refactoring/etc. #696
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I got slightly carried away and this ended up significantly changing nearly every single kernel in train_gpt2_fp32.cu! I have also added a lot of comments to the kernels - possibly too many, but if the FP32 version is meant to be strictly educational then hopefully most/some of them are useful.
The biggest difference is that several kernels use the new "row_reduction()" function to abstract away some of the computation in a way that makes intuitive sense to me, while remaining much less general than something like PyTorch which supports that over every axis and in every way etc... the row_reduction code is actually pretty tiny, and personally I feel it makes it more obvious rather than less what's going on. It does require a little bit "more C++" but nothing too crazy (in my opinion). The kernels that don't use row_reduction() are also significantly simplified.
Another big change is I've added a TF32 matrix multiplication kernel based on NVIDIA's CUDA sample at https://github.com/NVIDIA/cuda-samples/tree/v12.4/Samples/3_CUDA_Features/tf32TensorCoreGemm - unfortunately, it's still MUCH slower than cuBLAS (80K T/s for cuBLAS, 50K T/s for TF32, 40K T/s for FP32... and that's only replacing the forward matmul kernel, performance would be so much worse if all matrix multiplies were using our custom matmuls). So, I decided to add back cuBLAS (non-Lt so non-fused bias) as the new default, while keeping both the current default FP32 and new TF32 matmul forward kernels available on the command line ("-c 1" to use the new TF32 kernel).
In terms of changes that might be useful for the non-FP32 version: I think the backward bias kernel in the current llm.c master is overcomplicated, and the approach I came up with for this new FP32 version might even be faster, so I'm planning to modify it to work with train_gpt2.cu later.