You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RuntimeError: Stop_waiting response is expected
indicates that the problem is on torch's end. Please ensure your environment is properly set up (PyTorch version, CUDA) and re-run.
if i run any exampels or training setup from applications in colossalAI I get this issue
can you help me in solving this issue?
RuntimeError: Stop_waiting response is expected
Error: failed to run torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 --master_addr= --master_port= vit_train_demo.py --model_name_or_path google/vit-base-patch16-224 --output_path ./home/jovyan/malika/ColossalAI/examples/images/vit --plugin hybrid_parallel --batch_size 8 --tp_size 4 --pp_size 1 --num_epoch 3 --learning_rate 2e-4 --weight_decay 0.05 --warmup_ratio 0.3 on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!
Command: 'cd /home/jovyan/malika/ColossalAI/examples/images/vit && export SHELL="/bin/bash" COLORTERM="truecolor" TERM_PROGRAM_VERSION="1.90.2" LC_ADDRESS="ko_KR.UTF-8" LC_NAME="ko_KR.UTF-8" LC_MONETARY="ko_KR.UTF-8" PWD="/home/jovyan/malika/ColossalAI/examples/images/vit" LOGNAME="jovyan" NCCL_DEBUG="INFO" VSCODE_GIT_ASKPASS_NODE="/home/jovyan/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/node" MOTD_SHOWN="pam" HOME="/home/jovyan" LANG="ko_KR.UTF-8" LC_PAPER="ko_KR.UTF-8" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:*.xspf=00;36:" VIRTUAL_ENV="/home/jovyan/.venv/torch2.3.0-py3.10-cuda11.8" SSL_CERT_DIR="/usr/lib/ssl/certs" GIT_ASKPASS="/home/jovyan/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/extensions/git/dist/askpass.sh" SSH_CONNECTION="10.0.0.137 60450 10.0.31.75 22" CUDA_VISIBLE_DEVICES="0,1,2,3" LC_IDENTIFICATION="ko_KR.UTF-8" TERM="xterm-256color" USER="jovyan" VISIBLE="now" VSCODE_GIT_IPC_HANDLE="/tmp/vscode-git-044e287697.sock" SHLVL="2" LC_TELEPHONE="ko_KR.UTF-8" LC_MESSAGES="ko_KR.UTF-8" LC_MEASUREMENT="ko_KR.UTF-8" VIRTUAL_ENV_PROMPT="(torch2.3.0-py3.10-cuda11.8) " LD_LIBRARY_PATH="/usr/lib/nvidia:/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64" LC_CTYPE="ko_KR.UTF-8" SSL_CERT_FILE="/usr/lib/ssl/certs/ca-certificates.crt" SSH_CLIENT="10.0.0.137 60450 22" LC_TIME="ko_KR.UTF-8" OMP_NUM_THREADS="1" VSCODE_GIT_ASKPASS_MAIN="/home/jovyan/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/extensions/git/dist/askpass-main.js" CUDA_HOME="/usr/local/cuda" LC_COLLATE="ko_KR.UTF-8" GCC_COLORS="error=01;31:warning=01;35:note=01;36:caret=01;32:locus=01:quote=01" BROWSER="/home/jovyan/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/bin/helpers/browser.sh" PATH="/usr/local/cuda/bin:/home/jovyan/.venv/torch2.3.0-py3.10-cuda11.8/bin:/home/jovyan/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/bin/remote-cli:/usr/local/cuda/bin:/home/jovyan/.venv/torch2.3.0-py3.10-cuda11.8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" LC_NUMERIC="ko_KR.UTF-8" OLDPWD="/home/jovyan/malika/ColossalAI/examples/images" TERM_PROGRAM="vscode" VSCODE_IPC_HOOK_CLI="/tmp/vscode-ipc-3dbaed2b-9e4b-4150-855d-699020003867.sock" _="/home/jovyan/.venv/torch2.3.0-py3.10-cuda11.8/bin/colossalai" && torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 --master_addr=210.125.69.5 --master_port=32309 vit_train_demo.py --model_name_or_path google/vit-base-patch16-224 --output_path ./home/jovyan/malika/ColossalAI/examples/images/vit --plugin hybrid_parallel --batch_size 8 --tp_size 4 --pp_size 1 --num_epoch 3 --learning_rate 2e-4 --weight_decay 0.05 --warmup_ratio 0.3'
Exit code: 1
Stdout: already printed
Stderr: already printed
====== Training on All Nodes =====
127.0.0.1: failure
====== Stopping All Nodes =====
127.0.0.1: finish
The text was updated successfully, but these errors were encountered: