-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gradient accumulation #787
base: main
Are you sure you want to change the base?
Conversation
c93b2a6
to
03942bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested this gives the same loss as a large batch? E.g. you could run locally on just a v4-8 with per_device_batch=1, gradient_accumulation_steps=5 for 100 steps and per_device_batch_size=5 gradient_accumulation_steps=1 for 100 steps (should give near identical loss)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with above test to verify the result.
Also trying to learn more here. After some research, it seems we should do:
- Accumulated Gradients += Gradients (from current accumulation step)
- Averaged Gradients = Accumulated Gradients / Number of accumulation steps (instead of doing average all the way, or calculate gradients at the last microbatch only)
b39e98b
to
d4d99e3
Compare
Adding new feature
gradient accumulation
to only update weight for every x steps.Example command without using
gradient accumulation
:Example command with using
gradient accumulation
:Result1
Result2