Add RoPE positional encoding - llama3 feature branch #756

gordicaleksa · 2024-09-13T20:28:23Z

Implemented RoPE - rotary position embedding from the RoFormer paper.

Note:

I do not conditionally remove the allocation of our learnable position embedding buffer (wpe) as that would require touching many parts of the codebase that rely on the particular order inside the parameter buffer (e.g. wpe has index 1).
I do turn off fwd/bwd computation / grad norm computation / update for the wpe buffer.
The explicit tradeoff is: suffer a minimal memory bloat (maxT * C) but the PR has minimal impact on the readability of the codebase.

Tests:
I ran an A/B experiment: trained a 124M GPT-2 on 10B tokens (FineWeb subset) with:
a) learnable positional embeddings (default, -er == 0)
b) RoPE (-er == 1)
c) no positional embedding at all
all other settings being the same same.

Results:

Conclusions:

The validation loss is significantly better with RoPE
RoPE implementation slightly decresed the performance (consistent with what EleutherAI folks observed). I observed a drop ~1_632_000 -> ~1_603_000 tok/s (~1.7% perf hit).

ademeure · 2024-09-13T23:04:49Z

LGTM, we could reduce the number of GPU instructions for the kernels a bit by moving use_rope to being known at compile time, but it's debatable whether that's worth the small compile time increase of having more kernel versions given the kernel should be DRAM limited, see: https://godbolt.org/z/es6GzeePq

(it turns out the bigger optimisation by far is using a 3D grid to get rid of modulus/division/etc. for btc calculation, which is orthogonal and doesn't really belong in this PR)

gordicaleksa added 15 commits July 28, 2024 12:51

Add RoPE - support on cmdline

242566b

Add RoPE init kernel

a36bc62

Use float buffer for rope freqs; tested against python ref

c08540d

Do not use WPE when RoPE enabled

61a0376

Add initial RoPE kernel

46babfe

Reduce freqs table 2x

e90062c

Use x128 loads for RoPE fwd kernel

8516330

Implement rope bwd kernel

0f27b28

Change default rope value

96222c6

Minor refactor + fix fwd enc bug

841e229

Remove wpe grad communication

3fda17b

Bug fix: missing /2 in freq table in the kernel

7e0c497

Merge branch 'llama3' into add_rope

03185cf

Move rope changes to llama 3 file

b2a30c6

Add new line

2fc77cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RoPE positional encoding - llama3 feature branch #756

Add RoPE positional encoding - llama3 feature branch #756

gordicaleksa commented Sep 13, 2024

ademeure commented Sep 13, 2024 •

edited

Loading

Add RoPE positional encoding - llama3 feature branch #756

Are you sure you want to change the base?

Add RoPE positional encoding - llama3 feature branch #756

Conversation

gordicaleksa commented Sep 13, 2024

ademeure commented Sep 13, 2024 • edited Loading

ademeure commented Sep 13, 2024 •

edited

Loading