gradient accumulation #787

tonyjohnchen · 2024-07-18T17:43:00Z

Adding new feature gradient accumulation to only update weight for every x steps.

Example command without using gradient accumulation:

python3 MaxText/train.py MaxText/configs/base.yml   base_output_directory=${MAXTEXT_OUTPUT_PATH} run_name=${RUN_NAME}    enable_checkpointing=false async_checkpointing=false    per_device_batch_size=1    skip_first_n_steps_for_profiler=5 steps=30    dataset_type=synthetic    profiler=xplane

Example command with using gradient accumulation:

python3 MaxText/train.py MaxText/configs/base.yml   base_output_directory=${MAXTEXT_OUTPUT_PATH} run_name=${RUN_NAME}    enable_checkpointing=false async_checkpointing=false    per_device_batch_size=1    skip_first_n_steps_for_profiler=5 steps=30    dataset_type=synthetic    profiler=xplane gradient_accumulation_steps=10

Result1
Result2

anfals

Double

MaxText/train.py

gobbleturk · 2024-07-18T19:58:45Z

MaxText/train.py

Have you tested this gives the same loss as a large batch? E.g. you could run locally on just a v4-8 with per_device_batch=1, gradient_accumulation_steps=5 for 100 steps and per_device_batch_size=5 gradient_accumulation_steps=1 for 100 steps (should give near identical loss)

Agree with above test to verify the result.

Also trying to learn more here. After some research, it seems we should do:

Accumulated Gradients += Gradients (from current accumulation step)

Averaged Gradients = Accumulated Gradients / Number of accumulation steps (instead of doing average all the way, or calculate gradients at the last microbatch only)

MaxText/train.py

tonyjohnchen requested a review from gobbleturk as a code owner July 18, 2024 17:43

tonyjohnchen force-pushed the gradient branch 2 times, most recently from c93b2a6 to 03942bb Compare July 18, 2024 17:45

anfals reviewed Jul 18, 2024

View reviewed changes

MaxText/train.py Outdated Show resolved Hide resolved

tonyjohnchen requested a review from RissyRan July 18, 2024 17:49

RissyRan reviewed Jul 18, 2024

View reviewed changes

MaxText/train.py Outdated Show resolved Hide resolved

gobbleturk requested changes Jul 18, 2024

View reviewed changes

tonyjohnchen force-pushed the gradient branch 3 times, most recently from b39e98b to d4d99e3 Compare July 19, 2024 22:00

gradient accmulation

0953af5

tonyjohnchen force-pushed the gradient branch from d4d99e3 to 0953af5 Compare July 19, 2024 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gradient accumulation #787

gradient accumulation #787

tonyjohnchen commented Jul 18, 2024

anfals left a comment

gobbleturk Jul 18, 2024

RissyRan Jul 23, 2024

gradient accumulation #787

Are you sure you want to change the base?

gradient accumulation #787

Conversation

tonyjohnchen commented Jul 18, 2024

anfals left a comment

Choose a reason for hiding this comment

gobbleturk Jul 18, 2024

Choose a reason for hiding this comment

RissyRan Jul 23, 2024

Choose a reason for hiding this comment