Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: errror Colossalai 0.4.0/0.4.2 /usr/bin/supervisord #6032

Open
1 task done
Storm0921 opened this issue Aug 23, 2024 · 2 comments
Open
1 task done

[BUG]: errror Colossalai 0.4.0/0.4.2 /usr/bin/supervisord #6032

Storm0921 opened this issue Aug 23, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@Storm0921
Copy link

Is there an existing issue for this bug?

  • I have searched the existing issues

🐛 Describe the bug

当我启动训练的时候出现了下面的错误,怎么办?Ubuntu 22.04

llama) root@autodl-container-ad594b8360-ad0d4c6e:~/autodl-tmp/ColossalAI/applications/ColossalChat/examples/training_scripts# bash train_sft.sh
GPU Memory Usage:
     0  1 MiB
     1  1 MiB
Now CUDA_VISIBLE_DEVICES is set to:
CUDA_VISIBLE_DEVICES=0,1
/root/miniconda3/envs/llama/bin/colossalai
/root/miniconda3/envs/llama/bin/python
/bin/bash: line 1: export: `=/usr/bin/supervisord': not a valid identifier
Error: failed to run torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=31312 train_sft.py --pretrain /root/autodl-tmp/model/qwen/Qwen2-1___5B-Instruct --tokenizer_dir /root/autodl-tmp/model/qwen/Qwen2-1___5B-Instruct --save_interval 5 --dataset /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00000 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00001 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00002 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00003 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00004 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00005 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00006 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00007 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00008 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00009 --plugin zero2 --batch_size 2 --max_epochs 1 --accumulation_steps 4 --lr 5e-5 --max_len 4096 --grad_checkpoint --save_path /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/logsSFT-2024-08-23-19-13-31 --config_file /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/logsSFT-2024-08-23-19-13-31.json --log_dir /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/logsSFT-2024-08-23-19-13-31 on localhost, is localhost: True, exception: Encountered a bad command exit code!

Command: 'cd /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/training_scripts && export    ="/usr/bin/supervisord" SHELL="/bin/bash" NV_LIBCUBLAS_VERSION="12.1.0.26-1" NVIDIA_VISIBLE_DEVICES="GPU-f80c38c0-bbe5-0581-78f3-c584fea4b8c2,GPU-d2d56e77-1eaf-3254-c719-013dea3447cd" NV_NVML_DEV_VERSION="12.1.55-1" NV_CUDNN_PACKAGE_NAME="libcudnn8" NV_LIBNCCL_DEV_PACKAGE="libnccl-dev=2.17.1-1+cuda12.1" CONDA_EXE="/root/miniconda3/bin/conda" NV_LIBNCCL_DEV_PACKAGE_VERSION="2.17.1-1" HOSTNAME="autodl-container-ad594b8360-ad0d4c6e" NVIDIA_REQUIRE_CUDA="cuda>=12.1 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526" NV_LIBCUBLAS_DEV_PACKAGE="libcublas-dev-12-1=12.1.0.26-1" NV_NVTX_VERSION="12.1.66-1" NV_CUDA_CUDART_DEV_VERSION="12.1.55-1" NV_LIBCUSPARSE_VERSION="12.0.2.55-1" NV_LIBNPP_VERSION="12.0.2.50-1" NCCL_VERSION="2.17.1-1" PWD="/root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/training_scripts" AutoDLContainerUUID="ad594b8360-ad0d4c6e" NV_CUDNN_PACKAGE="libcudnn8=8.9.0.131-1+cuda12.1" CONDA_PREFIX="/root/miniconda3/envs/llama" NVIDIA_DRIVER_CAPABILITIES="compute,utility,graphics,video" JUPYTER_SERVER_URL="http://autodl-container-ad594b8360-ad0d4c6e:8888/jupyter/" NV_NVPROF_DEV_PACKAGE="cuda-nvprof-12-1=12.1.55-1" NV_LIBNPP_PACKAGE="libnpp-12-1=12.0.2.50-1" NV_LIBNCCL_DEV_PACKAGE_NAME="libnccl-dev" TZ="Asia/Shanghai" NV_LIBCUBLAS_DEV_VERSION="12.1.0.26-1" NVIDIA_PRODUCT_NAME="CUDA" NV_LIBCUBLAS_DEV_PACKAGE_NAME="libcublas-dev-12-1" LINES="57" NV_CUDA_CUDART_VERSION="12.1.55-1" AutoDLServiceURL="https://u258683-8360-ad0d4c6e.westc.gpuhub.com:8443" HOME="/root" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:" COLUMNS="221" NVIDIA_CUDA_END_OF_LIFE="1" AutoDLRegion="west-C" CUDA_VERSION="12.1.0" AgentHost="172.21.0.184" NV_LIBCUBLAS_PACKAGE="libcublas-12-1=12.1.0.26-1" NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE="cuda-nsight-compute-12-1=12.1.0-1" CONDA_PROMPT_MODIFIER="(llama) " PYDEVD_USE_FRAME_EVAL="NO" PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION="python" NV_LIBNPP_DEV_PACKAGE="libnpp-dev-12-1=12.0.2.50-1" NV_LIBCUBLAS_PACKAGE_NAME="libcublas-12-1" NV_LIBNPP_DEV_VERSION="12.0.2.50-1" CUDA_VISIBLE_DEVICES="0,1" JUPYTER_SERVER_ROOT="/root" TERM="xterm-256color" NV_LIBCUSPARSE_DEV_VERSION="12.0.2.55-1" LIBRARY_PATH="/usr/local/cuda/lib64/stubs" NV_CUDNN_VERSION="8.9.0.131" AutodlAutoPanelToken="jupyter-autodl-container-ad594b8360-ad0d4c6e-3db8691ef93a642afa031eb431dfa54b2de5d9a5f90f44fa69fbd6d8ab6c8ef2b" CONDA_SHLVL="2" SHLVL="3" PYXTERM_DIMENSIONS="80x25" NV_CUDA_LIB_VERSION="12.1.0-1" NVARCH="x86_64" NV_CUDNN_PACKAGE_DEV="libcudnn8-dev=8.9.0.131-1+cuda12.1" NV_CUDA_COMPAT_PACKAGE="cuda-compat-12-1" CONDA_PYTHON_EXE="/root/miniconda3/bin/python" NV_LIBNCCL_PACKAGE="libnccl2=2.17.1-1+cuda12.1" LD_LIBRARY_PATH="/usr/local/nvidia/lib:/usr/local/nvidia/lib64" LC_CTYPE="C.UTF-8" CONDA_DEFAULT_ENV="llama" NV_CUDA_NSIGHT_COMPUTE_VERSION="12.1.0-1" REQUESTS_CA_BUNDLE="/etc/ssl/certs/ca-certificates.crt" OMP_NUM_THREADS="32" NV_NVPROF_VERSION="12.1.55-1" PATH="/root/miniconda3/envs/llama/bin:/root/miniconda3/condabin:/root/miniconda3/bin:/usr/local/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" NV_LIBNCCL_PACKAGE_NAME="libnccl2" NV_LIBNCCL_PACKAGE_VERSION="2.17.1-1" MKL_NUM_THREADS="32" CONDA_PREFIX_1="/root/miniconda3" DEBIAN_FRONTEND="noninteractive" OLDPWD="/root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts" _="/root/miniconda3/envs/llama/bin/colossalai" && torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=31312 train_sft.py --pretrain /root/autodl-tmp/model/qwen/Qwen2-1___5B-Instruct --tokenizer_dir /root/autodl-tmp/model/qwen/Qwen2-1___5B-Instruct --save_interval 5 --dataset /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00000 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00001 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00002 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00003 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00004 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00005 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00006 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00007 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00008 /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/data_preparation_scripts/arrow/part-00009 --plugin zero2 --batch_size 2 --max_epochs 1 --accumulation_steps 4 --lr 5e-5 --max_len 4096 --grad_checkpoint --save_path /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/logsSFT-2024-08-23-19-13-31 --config_file /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/logsSFT-2024-08-23-19-13-31.json --log_dir /root/autodl-tmp/ColossalAI/applications/ColossalChat/examples/logsSFT-2024-08-23-19-13-31'

Exit code: 1

Stdout: already printed

Stderr: already printed



====== Training on All Nodes =====
localhost: failure

====== Stopping All Nodes =====
localhost: finish

Environment

No response

@Storm0921 Storm0921 added the bug Something isn't working label Aug 23, 2024
@wangbluo
Copy link
Contributor

Please check your environment, the error meassage '/bin/bash: line 1: export: =/usr/bin/supervisord': not a valid identifier`'
indicates that there was an attempt to set an environment variable without specifying a variable name. Please check your .bachrc or do source .bachrc again.

@Storm0921
Copy link
Author

Please check your environment, the error meassage '/bin/bash: line 1: export: =/usr/bin/supervisord': not a valid identifier`' indicates that there was an attempt to set an environment variable without specifying a variable name. Please check your .bachrc or do source .bachrc again.

thanks for your reply.But I run it according to the official example, I don't know how to modify this error, can you give me some guidance? I checked the ~/.bashrc file and it was fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants