You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are currently working with llama model in megatron(Specifically MEGATRON-LLAMA). We tried to run pretraining llama on single node with 8 GPUs using torchrun commnad, the training went perfectly fine but when we tried to run on multinode using slurm there was some TypeError: perform_nocolor_split(): incompatible function arguments.
we are using PyTorch NGC Container ( version: 24.01)
TP_SIZE=1
PP_SIZE=1
3: [rank3]: Traceback (most recent call last):
3: [rank3]: File "/workspace/pretrain_llama.py", line 118, in
3: [rank3]: pretrain(train_valid_test_datasets_provider, model_provider,
3: [rank3]: File "/workspace/megatron/training.py", line 90, in pretrain
3: [rank3]: initialize_megatron(extra_args_provider=extra_args_provider,
3: [rank3]: File "/workspace/megatron/initialize.py", line 76, in initialize_megatron
3: [rank3]: finish_mpu_init()
3: [rank3]: File "/workspace/megatron/initialize.py", line 56, in finish_mpu_init
3: [rank3]: _initialize_distributed()
3: [rank3]: File "/workspace/megatron/initialize.py", line 185, in _initialize_distributed
3: [rank3]: mpu.initialize_model_parallel(args.tensor_model_parallel_size,
3: [rank3]: File "/workspace/megatron/core/parallel_state.py", line 160, in initialize_model_parallel
3: [rank3]: group = torch.distributed.new_group(list(ranks))
3: [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 92, in wrapper
3: [rank3]: func_return = func(*args, **kwargs)
3: [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4092, in new_group
3: [rank3]: return _new_group_with_tag(
3: [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4172, in _new_group_with_tag
3: [rank3]: pg, pg_store = _new_process_group_helper(
3: [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1507, in _new_process_group_helper
3: [rank3]: split_from.perform_nocolor_split(_get_default_group().bound_device_id)
3: [rank3]: TypeError: perform_nocolor_split(): incompatible function arguments. The following argument types are supported:
3: [rank3]: 1. (self: torch._C._distributed_c10d.ProcessGroupNCCL, arg0: torch.device) -> None
3: [rank3]: Invoked with: <torch.distributed.distributed_c10d.ProcessGroupNCCL object at 0x1551a5b06bb0>, None
The text was updated successfully, but these errors were encountered:
We are currently working with llama model in megatron(Specifically MEGATRON-LLAMA). We tried to run pretraining llama on single node with 8 GPUs using torchrun commnad, the training went perfectly fine but when we tried to run on multinode using slurm there was some TypeError: perform_nocolor_split(): incompatible function arguments.
we are using PyTorch NGC Container ( version: 24.01)
TP_SIZE=1
PP_SIZE=1
3: [rank3]: Traceback (most recent call last):
3: [rank3]: File "/workspace/pretrain_llama.py", line 118, in
3: [rank3]: pretrain(train_valid_test_datasets_provider, model_provider,
3: [rank3]: File "/workspace/megatron/training.py", line 90, in pretrain
3: [rank3]: initialize_megatron(extra_args_provider=extra_args_provider,
3: [rank3]: File "/workspace/megatron/initialize.py", line 76, in initialize_megatron
3: [rank3]: finish_mpu_init()
3: [rank3]: File "/workspace/megatron/initialize.py", line 56, in finish_mpu_init
3: [rank3]: _initialize_distributed()
3: [rank3]: File "/workspace/megatron/initialize.py", line 185, in _initialize_distributed
3: [rank3]: mpu.initialize_model_parallel(args.tensor_model_parallel_size,
3: [rank3]: File "/workspace/megatron/core/parallel_state.py", line 160, in initialize_model_parallel
3: [rank3]: group = torch.distributed.new_group(list(ranks))
3: [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 92, in wrapper
3: [rank3]: func_return = func(*args, **kwargs)
3: [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4092, in new_group
3: [rank3]: return _new_group_with_tag(
3: [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4172, in _new_group_with_tag
3: [rank3]: pg, pg_store = _new_process_group_helper(
3: [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1507, in _new_process_group_helper
3: [rank3]: split_from.perform_nocolor_split(_get_default_group().bound_device_id)
3: [rank3]: TypeError: perform_nocolor_split(): incompatible function arguments. The following argument types are supported:
3: [rank3]: 1. (self: torch._C._distributed_c10d.ProcessGroupNCCL, arg0: torch.device) -> None
3: [rank3]: Invoked with: <torch.distributed.distributed_c10d.ProcessGroupNCCL object at 0x1551a5b06bb0>, None
The text was updated successfully, but these errors were encountered: