TypeError: perform_nocolor_split(): incompatible function arguments. #69

vedantgoswami · 2024-08-14T11:50:21Z

We are currently working with llama model in megatron(Specifically MEGATRON-LLAMA). We tried to run pretraining llama on single node with 8 GPUs using torchrun commnad, the training went perfectly fine but when we tried to run on multinode using slurm there was some TypeError: perform_nocolor_split(): incompatible function arguments.
we are using PyTorch NGC Container ( version: 24.01)
TP_SIZE=1
PP_SIZE=1

3: [rank3]: Traceback (most recent call last):
3: [rank3]: File "/workspace/pretrain_llama.py", line 118, in
3: [rank3]: pretrain(train_valid_test_datasets_provider, model_provider,
3: [rank3]: File "/workspace/megatron/training.py", line 90, in pretrain
3: [rank3]: initialize_megatron(extra_args_provider=extra_args_provider,
3: [rank3]: File "/workspace/megatron/initialize.py", line 76, in initialize_megatron
3: [rank3]: finish_mpu_init()
3: [rank3]: File "/workspace/megatron/initialize.py", line 56, in finish_mpu_init
3: [rank3]: _initialize_distributed()
3: [rank3]: File "/workspace/megatron/initialize.py", line 185, in _initialize_distributed
3: [rank3]: mpu.initialize_model_parallel(args.tensor_model_parallel_size,
3: [rank3]: File "/workspace/megatron/core/parallel_state.py", line 160, in initialize_model_parallel
3: [rank3]: group = torch.distributed.new_group(list(ranks))
3: [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 92, in wrapper
3: [rank3]: func_return = func(*args, **kwargs)
3: [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4092, in new_group
3: [rank3]: return _new_group_with_tag(
3: [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4172, in _new_group_with_tag
3: [rank3]: pg, pg_store = _new_process_group_helper(
3: [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1507, in _new_process_group_helper
3: [rank3]: split_from.perform_nocolor_split(_get_default_group().bound_device_id)
3: [rank3]: TypeError: perform_nocolor_split(): incompatible function arguments. The following argument types are supported:
3: [rank3]: 1. (self: torch._C._distributed_c10d.ProcessGroupNCCL, arg0: torch.device) -> None
3: [rank3]: Invoked with: <torch.distributed.distributed_c10d.ProcessGroupNCCL object at 0x1551a5b06bb0>, None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError: perform_nocolor_split(): incompatible function arguments. #69

TypeError: perform_nocolor_split(): incompatible function arguments. #69

vedantgoswami commented Aug 14, 2024

TypeError: perform_nocolor_split(): incompatible function arguments. #69

TypeError: perform_nocolor_split(): incompatible function arguments. #69

Comments

vedantgoswami commented Aug 14, 2024