Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Micro optimization for softmax_forward_kernel5 #762

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

insop
Copy link

@insop insop commented Sep 20, 2024

This branch includes a micro-optimization for softmax_forward_kernel5.

Summary

  • use warpReduceMax in attention_forward.cu to use __shfl_down_sync to be consistent with the other kernels (reduce to all threads in a warp)

  • micro optimization for softmax_forward_kernel5

    • Compared to the original code, the with this optimization gain improvements (left: original code, right: modified code):

      • Duration: 1.47 ms -> 1.38 ms
      • Compute (SM) [%]: 77.11% -> 78.68%
      • Memory [%]: 45.03% -> 54.15%
  • tests done:

    • ./profile_gpt2cu
    • ./attention_forward 4
    • ./attention_forward 5

output from modified code

  • NCU log using A30
make profile_gpt2cu NO_MULTI_GPU=1
ncu ./profile_gpt2cu

  softmax_forward_kernel5(__nv_bfloat16 *, float, const __nv_bfloat16 *, int, int), 2024-Sep-20 01:45:01, Context 1, Stream 16
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           1.21
    SM Frequency                                                             cycle/usecond                         929.76
    Elapsed Cycles                                                                   cycle                        1283575
    Memory [%]                                                                           %                          54.15
    DRAM Throughput                                                                      %                          47.91
    Duration                                                                       msecond                           1.38
    L1/TEX Cache Throughput                                                              %                          54.50
    L2 Cache Throughput                                                                  %                          51.48
    SM Active Cycles                                                                 cycle                     1275362.68
    Compute (SM) [%]                                                                     %                          78.68
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis section to see what the   
          compute pipelines are spending their time doing. Also, consider whether any computation is redundant and      
          could be reduced or moved to look-up tables.                                                                  

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       36864
    Registers Per Thread                                                   register/thread                             32
    Shared Memory Configuration Size                                                 Kbyte                          32.77
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        9437184
    Waves Per SM                                                                                                    82.29
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                             32
    Block Limit Warps                                                                block                              8
    Theoretical Active Warps per SM                                                   warp                             64
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          92.61
    Achieved Active Warps Per SM                                                      warp                          59.27
    ---------------------------------------------------------------------- --------------- ------------------------------
    INF   This kernel's theoretical occupancy is not impacted by any block limit.                                       

Output from the original code

  • NCU log using A30
make profile_gpt2cu NO_MULTI_GPU=1
ncu ./profile_gpt2cu

  softmax_forward_kernel5(__nv_bfloat16 *, float, const __nv_bfloat16 *, int, int), 2024-Sep-20 01:49:03, Context 1, Stream 16
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           1.21
    SM Frequency                                                             cycle/usecond                         928.26
    Elapsed Cycles                                                                   cycle                        1366538
    Memory [%]                                                                           %                          45.03
    DRAM Throughput                                                                      %                          45.03
    Duration                                                                       msecond                           1.47
    L1/TEX Cache Throughput                                                              %                          33.10
    L2 Cache Throughput                                                                  %                          48.18
    SM Active Cycles                                                                 cycle                     1358789.59
    Compute (SM) [%]                                                                     %                          77.11
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis section to see what the   
          compute pipelines are spending their time doing. Also, consider whether any computation is redundant and      
          could be reduced or moved to look-up tables.                                                                  

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       36864
    Registers Per Thread                                                   register/thread                             32
    Shared Memory Configuration Size                                                 Kbyte                          32.77
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        9437184
    Waves Per SM                                                                                                    82.29
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                             32
    Block Limit Warps                                                                block                              8
    Theoretical Active Warps per SM                                                   warp                             64
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          93.33
    Achieved Active Warps Per SM                                                      warp                          59.73
    ---------------------------------------------------------------------- --------------- ------------------------------
    INF   This kernel's theoretical occupancy is not impacted by any block limit.                                       

output from ./attention_forward

nvcc -O3 --use_fast_math -lcublas -lcublasLt attention_forward.cu -o attention_forward
  • testing softmax_forward_kernel4
# ./attention_forward 4
enable_tf32: 1
Using kernel 4
Checking block size 32.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
Checking block size 64.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
Checking block size 128.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
Checking block size 256.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
Checking block size 512.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
All results match. Starting benchmarks.

block_size   32 | time 2.794404 ms
block_size   64 | time 2.136679 ms
block_size  128 | time 2.125906 ms
block_size  256 | time 2.128598 ms
block_size  512 | time 2.151445 ms

  • testing softmax_forward_kernel5

# ./attention_forward 5
enable_tf32: 1
Using kernel 5
Checking block size 32.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
Checking block size 64.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
Checking block size 128.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
Checking block size 256.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
Checking block size 512.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
All results match. Starting benchmarks.

block_size   32 | time 2.016379 ms
block_size   64 | time 1.455155 ms
block_size  128 | time 1.452482 ms
block_size  256 | time 1.450271 ms
block_size  512 | time 1.454224 ms

insop song and others added 4 commits September 16, 2024 16:00
- Micro optimize softmax_forward5
- use __shfl_xor_sync for warpReducMax for all threads return the max
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant