Micro optimization for `softmax_forward_kernel5` #762

insop · 2024-09-20T02:28:11Z

This branch includes a micro-optimization for softmax_forward_kernel5.

Summary

use warpReduceMax in attention_forward.cu to use __shfl_down_sync to be consistent with the other kernels (reduce to all threads in a warp)
micro optimization for softmax_forward_kernel5
- Compared to the original code, the with this optimization gain improvements (left: original code, right: modified code):
  - Duration: 1.47 ms -> 1.38 ms
  - Compute (SM) [%]: 77.11% -> 78.68%
  - Memory [%]: 45.03% -> 54.15%
tests done:
- ./profile_gpt2cu
- ./attention_forward 4
- ./attention_forward 5

output from modified code

NCU log using A30

make profile_gpt2cu NO_MULTI_GPU=1
ncu ./profile_gpt2cu

  softmax_forward_kernel5(__nv_bfloat16 *, float, const __nv_bfloat16 *, int, int), 2024-Sep-20 01:45:01, Context 1, Stream 16
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           1.21
    SM Frequency                                                             cycle/usecond                         929.76
    Elapsed Cycles                                                                   cycle                        1283575
    Memory [%]                                                                           %                          54.15
    DRAM Throughput                                                                      %                          47.91
    Duration                                                                       msecond                           1.38
    L1/TEX Cache Throughput                                                              %                          54.50
    L2 Cache Throughput                                                                  %                          51.48
    SM Active Cycles                                                                 cycle                     1275362.68
    Compute (SM) [%]                                                                     %                          78.68
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis section to see what the   
          compute pipelines are spending their time doing. Also, consider whether any computation is redundant and      
          could be reduced or moved to look-up tables.                                                                  

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       36864
    Registers Per Thread                                                   register/thread                             32
    Shared Memory Configuration Size                                                 Kbyte                          32.77
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        9437184
    Waves Per SM                                                                                                    82.29
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                             32
    Block Limit Warps                                                                block                              8
    Theoretical Active Warps per SM                                                   warp                             64
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          92.61
    Achieved Active Warps Per SM                                                      warp                          59.27
    ---------------------------------------------------------------------- --------------- ------------------------------
    INF   This kernel's theoretical occupancy is not impacted by any block limit.

Output from the original code

NCU log using A30

make profile_gpt2cu NO_MULTI_GPU=1
ncu ./profile_gpt2cu

  softmax_forward_kernel5(__nv_bfloat16 *, float, const __nv_bfloat16 *, int, int), 2024-Sep-20 01:49:03, Context 1, Stream 16
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           1.21
    SM Frequency                                                             cycle/usecond                         928.26
    Elapsed Cycles                                                                   cycle                        1366538
    Memory [%]                                                                           %                          45.03
    DRAM Throughput                                                                      %                          45.03
    Duration                                                                       msecond                           1.47
    L1/TEX Cache Throughput                                                              %                          33.10
    L2 Cache Throughput                                                                  %                          48.18
    SM Active Cycles                                                                 cycle                     1358789.59
    Compute (SM) [%]                                                                     %                          77.11
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis section to see what the   
          compute pipelines are spending their time doing. Also, consider whether any computation is redundant and      
          could be reduced or moved to look-up tables.                                                                  

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       36864
    Registers Per Thread                                                   register/thread                             32
    Shared Memory Configuration Size                                                 Kbyte                          32.77
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                        9437184
    Waves Per SM                                                                                                    82.29
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                             32
    Block Limit Warps                                                                block                              8
    Theoretical Active Warps per SM                                                   warp                             64
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          93.33
    Achieved Active Warps Per SM                                                      warp                          59.73
    ---------------------------------------------------------------------- --------------- ------------------------------
    INF   This kernel's theoretical occupancy is not impacted by any block limit.

output from `./attention_forward`

nvcc -O3 --use_fast_math -lcublas -lcublasLt attention_forward.cu -o attention_forward

testing softmax_forward_kernel4

# ./attention_forward 4
enable_tf32: 1
Using kernel 4
Checking block size 32.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
Checking block size 64.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
Checking block size 128.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
Checking block size 256.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
Checking block size 512.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
All results match. Starting benchmarks.

block_size   32 | time 2.794404 ms
block_size   64 | time 2.136679 ms
block_size  128 | time 2.125906 ms
block_size  256 | time 2.128598 ms
block_size  512 | time 2.151445 ms

testing softmax_forward_kernel5


# ./attention_forward 5
enable_tf32: 1
Using kernel 5
Checking block size 32.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
Checking block size 64.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
Checking block size 128.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
Checking block size 256.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
Checking block size 512.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
All results match. Starting benchmarks.

block_size   32 | time 2.016379 ms
block_size   64 | time 1.455155 ms
block_size  128 | time 1.452482 ms
block_size  256 | time 1.450271 ms
block_size  512 | time 1.454224 ms

- Micro optimize softmax_forward5 - use __shfl_xor_sync for warpReducMax for all threads return the max

insop song and others added 4 commits September 16, 2024 16:00

minor update

87deba5

softmax_forward_5 optimziation

78b447f

Micro optimize softmax_forward5

321dbad

- Micro optimize softmax_forward5 - use __shfl_xor_sync for warpReducMax for all threads return the max

Update shared memory since maxvals is not needed

052434a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Micro optimization for `softmax_forward_kernel5` #762

Micro optimization for `softmax_forward_kernel5` #762

insop commented Sep 20, 2024

Micro optimization for softmax_forward_kernel5 #762

Are you sure you want to change the base?

Micro optimization for softmax_forward_kernel5 #762

Conversation

insop commented Sep 20, 2024

Summary

output from modified code

Output from the original code

output from ./attention_forward

Micro optimization for `softmax_forward_kernel5` #762

Micro optimization for `softmax_forward_kernel5` #762

output from `./attention_forward`