-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory saving only in checkpoint size, not during training #8
Comments
Hi! Thanks for your interest and your question! We are not sure why there is a reduction in checkpoint size but no memory reduction during training. This issue seems unusual. For GPT2-1.5B (without weight_tying), there should be ~5G memory reduction per-GPU during the training. Here are our suggestions.
Please feel free to update your findings here! |
The first 4 GPUs are using adam mini, the latter are using adamw. Interestingly, switching to adamw on your codebase runs into an OOM error, but using the original nanoGPT runs fine. As can be seen, one GPU is using more memory than the others in adam mini, not sure why. This is the original GPT2 XL model without weight tying. Batch size and gradient accumulation are same as those above (12, 336). There is no CPU offloading. |
Hi @aditya2331 ! Thanks for the update! Regarding your figure: it seems that there is only ~300m mem reduction on each GPU. This is unexpected because normally it should be ~5GB reduction in your setting. Could you double check your ckpt size to see if there is actually ~5GB reduction? Regarding your comment "adamw on your codebase runs into an OOM error, but using the original nanoGPT runs fine." : this also seems unusual since our code is essentially nanoGPT. Perhaps a simple way debug is to: import and run Adam-mini on the original nanoGPT code to see if things get better. Please feel free to further update here! |
Same issue in huggingface trainer. |
Hi @980202006 ! This issue seems unusual. We do observe substantial memory reduction on Huggingface trainer. We share our results as follows. Setting: We use Huggingface trainer under Llamafactory codebase. We conduct SFT on Llama2-7b. We use gradient checkpointing, batch size = 4 and DeepSpeed zero3. The memory usage is shown as follows:
Could you share more training details on your side? It would help us debug. Thanks a lot! |
@zyushun Thank you! I will try again. |
Hi @aditya2331 , regarding " switching to adamw on your codebase runs into an OOM error, but using the original nanoGPT runs fine." I found it might due to different attention implementation between our model.py (which is actually an old version from nanoGPT) and the latest model.py in nanoGPT. The difference lies in "CausalSelfAttention": the old version uses separated Q, K, V while the latest one combines them together. Mathematically, these two implementations should be equivalent, but computationally, they might lead to different memory consumption. We recommend using our model.py to avoid any potential unexpected error. |
Hi, I tried running the normal train_gpt2.py code with Adam_mini. I had to remove the ValueError and Hessian spectrum import to make it work properly. I noticed that there was a good reduction in the checkpoint size but no memory reduction during the training, so I couldn't fit a higher batch size for more throughput. The memory usage came out to be the same as regular Adam while training.
Training Details:
DDP with 4xA100 80GB GPUs, batch size 12, 336 gradient accumulation steps, GPT2 XL model with weight tying removed
The text was updated successfully, but these errors were encountered: