Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: perform_nocolor_split(): incompatible function arguments. #69

Open
vedantgoswami opened this issue Aug 14, 2024 · 0 comments

Comments

@vedantgoswami
Copy link

We are currently working with llama model in megatron(Specifically MEGATRON-LLAMA). We tried to run pretraining llama on single node with 8 GPUs using torchrun commnad, the training went perfectly fine but when we tried to run on multinode using slurm there was some TypeError: perform_nocolor_split(): incompatible function arguments.
we are using PyTorch NGC Container ( version: 24.01)
TP_SIZE=1
PP_SIZE=1

3: [rank3]: Traceback (most recent call last):
3: [rank3]: File "/workspace/pretrain_llama.py", line 118, in
3: [rank3]: pretrain(train_valid_test_datasets_provider, model_provider,
3: [rank3]: File "/workspace/megatron/training.py", line 90, in pretrain
3: [rank3]: initialize_megatron(extra_args_provider=extra_args_provider,
3: [rank3]: File "/workspace/megatron/initialize.py", line 76, in initialize_megatron
3: [rank3]: finish_mpu_init()
3: [rank3]: File "/workspace/megatron/initialize.py", line 56, in finish_mpu_init
3: [rank3]: _initialize_distributed()
3: [rank3]: File "/workspace/megatron/initialize.py", line 185, in _initialize_distributed
3: [rank3]: mpu.initialize_model_parallel(args.tensor_model_parallel_size,
3: [rank3]: File "/workspace/megatron/core/parallel_state.py", line 160, in initialize_model_parallel
3: [rank3]: group = torch.distributed.new_group(list(ranks))
3: [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 92, in wrapper
3: [rank3]: func_return = func(*args, **kwargs)
3: [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4092, in new_group
3: [rank3]: return _new_group_with_tag(
3: [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4172, in _new_group_with_tag
3: [rank3]: pg, pg_store = _new_process_group_helper(
3: [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1507, in _new_process_group_helper
3: [rank3]: split_from.perform_nocolor_split(_get_default_group().bound_device_id)
3: [rank3]: TypeError: perform_nocolor_split(): incompatible function arguments. The following argument types are supported:
3: [rank3]: 1. (self: torch._C._distributed_c10d.ProcessGroupNCCL, arg0: torch.device) -> None
3: [rank3]: Invoked with: <torch.distributed.distributed_c10d.ProcessGroupNCCL object at 0x1551a5b06bb0>, None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant