Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready. #47183

Open
hxue3 opened this issue Aug 16, 2024 · 7 comments
Open
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P2 Important issue, but not time-critical usability

Comments

@hxue3
Copy link

hxue3 commented Aug 16, 2024

What happened + What you expected to happen

I am trying to load a quantized large model with vLLM. It is able to start the model loading, but it sometimes will stop loading the model and returns the error message ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready.

Versions / Dependencies

ray: 2.34.0
python: 3.10
OS: ubuntu 22

Reproduction script

# Create a class to do batch inference.
class LLMPredictor:

    def __init__(self):
        # Create an LLM.
        #ray.shutdown()

        #ray.init(num_gpus=torch.cuda.device_count())
        self.llm = LLM(model="/s/hpc-datasets/models/llama-3/Meta-Llama-3.1-405B-Instruct-FP8",
                       tensor_parallel_size=tensor_parallel_size,gpu_memory_utilization=0.95,max_model_len=32768, max_num_batched_tokens=32768)

    def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]:
        # Generate texts from the prompts.
        # The output is a list of RequestOutput objects that contain the prompt,
        # generated text, and other information.
        outputs = self.llm.generate(batch["prompts"], sampling_params)
        prompt: List[str] = []
        generated_text: List[str] = []
        for output in outputs:
            prompt.append(output.prompt)
            generated_text.append(' '.join([o.text for o in output.outputs]))
        return {
            "prompt": prompt,
            "generated_text": generated_text,
        }

ds = ray.data.read_csv("sample_prompts.csv",parse_options=parse_options)


# For tensor_parallel_size > 1, we need to create placement groups for vLLM
# to use. Every actor has to have its own placement group.
def scheduling_strategy_fn():
    # One bundle per tensor parallel worker
    pg = ray.util.placement_group(
        [{
            "GPU": 1,
            "CPU": 1
        }] * tensor_parallel_size,
        strategy="STRICT_PACK",
    )
    return dict(scheduling_strategy=PlacementGroupSchedulingStrategy(
        pg, placement_group_capture_child_tasks=True))


resources_kwarg: Dict[str, Any] = {}
if tensor_parallel_size == 1:
    # For tensor_parallel_size == 1, we simply set num_gpus=1.
    resources_kwarg["num_gpus"] = 1
else:
    # Otherwise, we have to set num_gpus=0 and provide
    # a function that will create a placement group for
    # each instance.
    resources_kwarg["num_gpus"] = 0
    resources_kwarg["ray_remote_args_fn"] = scheduling_strategy_fn


# Apply batch inference for all input data.
ds = ds.map_batches(
    LLMPredictor,
    # Set the concurrency to the number of LLM instances.
    concurrency=num_instances,
    # Specify the batch size for inference.
    batch_size=2,
    **resources_kwarg,
)

Issue Severity

High: It blocks me from completing my task.

@hxue3 hxue3 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 16, 2024
@rkooo567 rkooo567 changed the title [<Ray component: Core|RLlib|etc...>] ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready. [Core] ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready. Aug 19, 2024
@rkooo567 rkooo567 added the core Issues that should be addressed in Ray Core label Aug 19, 2024
@rkooo567
Copy link
Contributor

can you provide a full stacktrace when you see

I am trying to load a quantized large model with vLLM. It is able to start the model loading, but it sometimes will stop loading the model and returns the error message ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready.

@hxue3
Copy link
Author

hxue3 commented Aug 19, 2024

can you provide a full stacktrace when you see

I am trying to load a quantized large model with vLLM. It is able to start the model loading, but it sometimes will stop loading the model and returns the error message ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready.
ray, version 2.34.0
/usr/bin/ssh ray hpcm04r08n03.hpc.ford.com 'ray start --address='19.62.140.84:6379''
ssh-keysign: no matching hostkey found for key ED25519 SHA256:vWBZ07sgIiNdjV05CoWXNtpeMzvzg/mHz2QSX4QWJSw
ssh_keysign: no reply
sign using hostkey ssh-ed25519 SHA256:vWBZ07sgIiNdjV05CoWXNtpeMzvzg/mHz2QSX4QWJSw failed
Permission denied, please try again.
Permission denied, please try again.
hxue3@hpcm04r08n03.hpc.ford.com: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password,keyboard-interactive,hostbased).
2024-08-19 17:51:36,012	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/pbs.8474282.hpcq/ray/session_2024-08-19_17-51-20_670517_323/logs/ray-data
2024-08-19 17:51:36,012	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadCSV] -> ActorPoolMapOperator[MapBatches(LLMPredictor)]
�[36m(_MapWorker pid=7279)�[0m INFO 08-19 17:51:40 config.py:484] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
�[36m(_MapWorker pid=7279)�[0m INFO 08-19 17:51:40 config.py:729] Defaulting to use ray for distributed inference
�[36m(_MapWorker pid=7279)�[0m INFO 08-19 17:51:40 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/s/hpc-datasets/models/llama-3/Meta-Llama-3.1-405B-Instruct-FP8', speculative_config=None, tokenizer='/s/hpc-datasets/models/llama-3/Meta-Llama-3.1-405B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=25000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fbgemm_fp8, enforce_eager=False, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/s/hpc-datasets/models/llama-3/Meta-Llama-3.1-405B-Instruct-FP8, use_v2_block_manager=False, enable_prefix_caching=False)
�[36m(_MapWorker pid=7279)�[0m INFO 08-19 17:51:40 ray_gpu_executor.py:117] use_ray_spmd_worker: False
�[36m(_MapWorker pid=7279)�[0m INFO 08-19 17:51:40 ray_gpu_executor.py:120] driver_ip: 19.62.140.84
�[36m(_MapWorker pid=7279)�[0m INFO 08-19 17:52:13 selector.py:161] Cannot use FlashAttention-2 backend for FP8 KV cache.
�[36m(_MapWorker pid=7279)�[0m INFO 08-19 17:52:13 selector.py:54] Using XFormers backend.
�[36m(_MapWorker pid=7279)�[0m /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
�[36m(_MapWorker pid=7279)�[0m   @torch.library.impl_abstract("xformers_flash::flash_fwd")
�[36m(_MapWorker pid=7279)�[0m   @torch.library.impl_abstract("xformers_flash::flash_bwd")
�[36m(_MapWorker pid=7279)�[0m INFO 08-19 17:52:22 utils.py:841] Found nccl from library libnccl.so.2
�[36m(_MapWorker pid=7279)�[0m INFO 08-19 17:52:22 pynccl.py:63] vLLM is using nccl==2.20.5
�[36m(RayWorkerWrapper pid=8266)�[0m INFO 08-19 17:52:13 selector.py:161] Cannot use FlashAttention-2 backend for FP8 KV cache.�[32m [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)�[0m
�[36m(RayWorkerWrapper pid=8266)�[0m INFO 08-19 17:52:13 selector.py:54] Using XFormers backend.�[32m [repeated 7x across cluster]�[0m
�[36m(_MapWorker pid=7279)�[0m INFO 08-19 17:52:29 custom_all_reduce_utils.py:234] reading GPU P2P access cache from /s/hxue3/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
�[36m(RayWorkerWrapper pid=8266)�[0m INFO 08-19 17:52:22 utils.py:841] Found nccl from library libnccl.so.2�[32m [repeated 7x across cluster]�[0m
�[36m(RayWorkerWrapper pid=8266)�[0m INFO 08-19 17:52:22 pynccl.py:63] vLLM is using nccl==2.20.5�[32m [repeated 7x across cluster]�[0m
�[36m(_MapWorker pid=7279)�[0m INFO 08-19 17:52:29 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fa97c98c880>, local_subscribe_port=60379, remote_subscribe_port=None)
�[36m(_MapWorker pid=7279)�[0m INFO 08-19 17:52:29 model_runner.py:720] Starting to load model /s/hpc-datasets/models/llama-3/Meta-Llama-3.1-405B-Instruct-FP8...
�[36m(_MapWorker pid=7279)�[0m INFO 08-19 17:52:29 selector.py:161] Cannot use FlashAttention-2 backend for FP8 KV cache.
�[36m(_MapWorker pid=7279)�[0m INFO 08-19 17:52:29 selector.py:54] Using XFormers backend.
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:   0% Completed | 0/109 [00:00<?, ?it/s]
�[36m(RayWorkerWrapper pid=8266)�[0m /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.�[32m [repeated 15x across cluster]�[0m
�[36m(RayWorkerWrapper pid=8266)�[0m   @torch.library.impl_abstract("xformers_flash::flash_fwd")�[32m [repeated 7x across cluster]�[0m
�[36m(RayWorkerWrapper pid=8266)�[0m   @torch.library.impl_abstract("xformers_flash::flash_bwd")�[32m [repeated 7x across cluster]�[0m
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:   1% Completed | 1/109 [00:04<07:43,  4.29s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:   2% Completed | 2/109 [00:08<07:50,  4.40s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:   3% Completed | 3/109 [00:12<07:28,  4.23s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:   4% Completed | 4/109 [00:15<06:17,  3.60s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:   5% Completed | 5/109 [00:17<05:35,  3.22s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:   6% Completed | 6/109 [00:23<06:35,  3.84s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:   6% Completed | 7/109 [00:27<06:50,  4.02s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:   7% Completed | 8/109 [00:32<07:07,  4.23s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:   8% Completed | 9/109 [00:36<07:01,  4.22s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:   9% Completed | 10/109 [00:38<05:58,  3.62s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  10% Completed | 11/109 [00:41<05:25,  3.32s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  11% Completed | 12/109 [00:43<05:01,  3.11s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  12% Completed | 13/109 [00:44<03:44,  2.34s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  13% Completed | 14/109 [00:45<02:58,  1.88s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  14% Completed | 15/109 [00:46<02:38,  1.69s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  15% Completed | 16/109 [00:47<02:27,  1.59s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  16% Completed | 17/109 [00:48<02:08,  1.39s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  17% Completed | 18/109 [00:49<01:47,  1.18s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  17% Completed | 19/109 [00:49<01:21,  1.11it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  18% Completed | 20/109 [00:50<01:04,  1.38it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  19% Completed | 21/109 [00:50<00:52,  1.69it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  20% Completed | 22/109 [00:50<00:48,  1.79it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  21% Completed | 23/109 [00:51<00:41,  2.06it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  22% Completed | 24/109 [00:51<00:37,  2.30it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  23% Completed | 25/109 [00:51<00:35,  2.38it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  24% Completed | 26/109 [00:52<00:31,  2.67it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  25% Completed | 27/109 [00:52<00:29,  2.81it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  26% Completed | 28/109 [00:52<00:26,  3.01it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  27% Completed | 29/109 [00:53<00:27,  2.89it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  28% Completed | 30/109 [00:53<00:26,  2.95it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  28% Completed | 31/109 [00:53<00:25,  3.09it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  29% Completed | 32/109 [00:53<00:23,  3.30it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  30% Completed | 33/109 [00:54<00:22,  3.43it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  31% Completed | 34/109 [00:54<00:19,  3.84it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  32% Completed | 35/109 [00:54<00:19,  3.88it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  33% Completed | 36/109 [00:54<00:19,  3.80it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  34% Completed | 37/109 [00:55<00:18,  3.87it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  35% Completed | 38/109 [00:55<00:18,  3.91it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  36% Completed | 39/109 [00:55<00:17,  3.98it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  37% Completed | 40/109 [00:55<00:17,  4.04it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  38% Completed | 41/109 [00:56<00:17,  3.97it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  39% Completed | 42/109 [00:56<00:16,  4.02it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  39% Completed | 43/109 [00:56<00:17,  3.82it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  40% Completed | 44/109 [00:56<00:17,  3.82it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  41% Completed | 45/109 [00:59<00:59,  1.08it/s]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  42% Completed | 46/109 [01:09<03:55,  3.73s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  43% Completed | 47/109 [01:20<06:10,  5.98s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  44% Completed | 48/109 [01:33<08:08,  8.02s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  45% Completed | 49/109 [01:36<06:32,  6.54s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  46% Completed | 50/109 [01:48<07:50,  7.97s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  47% Completed | 51/109 [01:58<08:29,  8.78s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  48% Completed | 52/109 [02:02<07:01,  7.39s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  49% Completed | 53/109 [02:13<07:40,  8.22s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  50% Completed | 54/109 [02:23<08:15,  9.02s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  50% Completed | 55/109 [02:34<08:38,  9.59s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  51% Completed | 56/109 [02:47<09:09, 10.37s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  52% Completed | 57/109 [03:00<09:53, 11.42s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  53% Completed | 58/109 [03:14<10:15, 12.07s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  54% Completed | 59/109 [03:23<09:18, 11.16s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  55% Completed | 60/109 [03:34<08:58, 11.00s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  56% Completed | 61/109 [03:47<09:17, 11.62s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  57% Completed | 62/109 [03:57<08:40, 11.08s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  58% Completed | 63/109 [04:07<08:17, 10.81s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  59% Completed | 64/109 [04:17<08:00, 10.68s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  60% Completed | 65/109 [04:29<08:06, 11.05s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  61% Completed | 66/109 [04:41<08:09, 11.39s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  61% Completed | 67/109 [04:52<07:57, 11.36s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  62% Completed | 68/109 [05:03<07:35, 11.11s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  63% Completed | 69/109 [05:16<07:43, 11.58s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  64% Completed | 70/109 [05:27<07:29, 11.53s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  65% Completed | 71/109 [05:38<07:11, 11.36s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  66% Completed | 72/109 [05:50<07:09, 11.60s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  67% Completed | 73/109 [06:00<06:43, 11.20s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  68% Completed | 74/109 [06:13<06:47, 11.64s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  69% Completed | 75/109 [06:30<07:30, 13.24s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  70% Completed | 76/109 [06:45<07:30, 13.64s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  71% Completed | 77/109 [06:55<06:48, 12.78s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  72% Completed | 78/109 [07:05<06:05, 11.80s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  72% Completed | 79/109 [07:15<05:37, 11.24s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  73% Completed | 80/109 [07:29<05:51, 12.11s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  74% Completed | 81/109 [07:41<05:37, 12.07s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  75% Completed | 82/109 [07:52<05:16, 11.72s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  76% Completed | 83/109 [08:03<04:56, 11.39s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  77% Completed | 84/109 [08:12<04:29, 10.80s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  78% Completed | 85/109 [08:22<04:11, 10.50s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  79% Completed | 86/109 [08:34<04:10, 10.91s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  80% Completed | 87/109 [08:43<03:52, 10.56s/it]
�[36m(_MapWorker pid=7279)�[0m 
Loading safetensors checkpoint shards:  81% Completed | 88/109 [08:55<03:51, 11.02s/it]
2024-08-19 18:01:36,067	ERROR exceptions.py:73 -- Exception occurred in Ray Data or Ray Core internal code. If you continue to see this error, please open an issue on the Ray project GitHub page with the full stack trace below: https://github.com/ray-project/ray/issues/new/choose
2024-08-19 18:01:36,067	ERROR exceptions.py:81 -- Full stack trace:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py", line 135, in start
    ray.get(refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2659, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 848, in get_objects
    data_metadata_pairs = self.core_worker.get_objects(
  File "python/ray/_raylet.pyx", line 3511, in ray._raylet.CoreWorker.get_objects
  File "python/ray/includes/common.pxi", line 81, in ray._raylet.check_status
ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ray/data/exceptions.py", line 49, in handle_trace
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/plan.py", line 423, in execute_to_iterator
    bundle_iter = execute_to_legacy_bundle_iterator(executor, self)
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/legacy_compat.py", line 51, in execute_to_legacy_bundle_iterator
    bundle_iter = executor.execute(dag, initial_stats=stats)
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/streaming_executor.py", line 114, in execute
    self._topology, _ = build_streaming_topology(dag, self._options)
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/streaming_executor_state.py", line 354, in build_streaming_topology
    setup_state(dag)
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/streaming_executor_state.py", line 351, in setup_state
    op.start(options)
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py", line 137, in start
    raise ray.exceptions.GetTimeoutError(
ray.exceptions.GetTimeoutError: Timed out while starting actors. This may mean that the cluster does not have enough resources for the requested actor pool.
ray.data.exceptions.SystemException

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/s/mlsc/hxue3/llama3-vllm/run_inference_quantized.py", line 99, in <module>
    outputs = ds.take_all()
  File "/usr/local/lib/python3.10/dist-packages/ray/data/dataset.py", line 2464, in take_all
    for row in self.iter_rows():
  File "/usr/local/lib/python3.10/dist-packages/ray/data/iterator.py", line 238, in _wrapped_iterator
    for batch in batch_iterable:
  File "/usr/local/lib/python3.10/dist-packages/ray/data/iterator.py", line 155, in _create_iterator
    ) = self._to_ref_bundle_iterator()
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/iterator/iterator_impl.py", line 28, in _to_ref_bundle_iterator
    ref_bundles_iterator, stats, executor = ds._plan.execute_to_iterator()
  File "/usr/local/lib/python3.10/dist-packages/ray/data/exceptions.py", line 89, in handle_trace
    raise e.with_traceback(None) from SystemException()
ray.exceptions.GetTimeoutError: Timed out while starting actors. This may mean that the cluster does not have enough resources for the requested actor pool.
�[36m(RayWorkerWrapper pid=8266)�[0m INFO 08-19 17:52:29 custom_all_reduce_utils.py:234] reading GPU P2P access cache from /s/hxue3/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json�[32m [repeated 7x across cluster]�[0m
�[36m(RayWorkerWrapper pid=8266)�[0m INFO 08-19 17:52:29 model_runner.py:720] Starting to load model /s/hpc-datasets/models/llama-3/Meta-Llama-3.1-405B-Instruct-FP8...�[32m [repeated 7x across cluster]�[0m
�[36m(RayWorkerWrapper pid=8266)�[0m INFO 08-19 17:52:29 selector.py:161] Cannot use FlashAttention-2 backend for FP8 KV cache.�[32m [repeated 7x across cluster]�[0m
�[36m(RayWorkerWrapper pid=8266)�[0m INFO 08-19 17:52:29 selector.py:54] Using XFormers backend.�[32m [repeated 7x across cluster]�[0m

@anyscalesam anyscalesam added data Ray Data-related issues and removed data Ray Data-related issues labels Aug 26, 2024
@anyscalesam
Copy link
Contributor

anyscalesam commented Aug 26, 2024

@hxue3 this line...
ray.exceptions.GetTimeoutError: Timed out while starting actors. This may mean that the cluster does not have enough resources for the requested actor pool. looks suspect - if you try and repro this with a brand new ray cluster what ray status return when it times out?

@anyscalesam anyscalesam added usability P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 26, 2024
@ggamsso
Copy link

ggamsso commented Sep 12, 2024

This is an error related to the timeout of the actor. As the model size increases, the time required to download the model from Hugging Face and to load it into vLLM also increases. You can avoid the GetTimeoutError by increasing the wait_for_min_actors_s value in the DataContext.

import ray
from ray.data import DataContext

# ray init
runtime_env = {"env_vars": {"HF_TOKEN": "__YOUR_HF_TOKEN__"}}
ray.init(runtime_env=runtime_env)

# data context
ctx = DataContext.get_current()
ctx.wait_for_min_actors_s = 60 * 10 * tensor_parallel_size

The wait_for_min_actors_s value is set to 60 * 10 seconds and is multiplied by the tensor_parallel_size.
If there are 8 GPUs, you can use up to 80 minutes for downloading the model and running vLLM.

@anyscalesam
Copy link
Contributor

cc @hxue3 as FYI - kudos @ggamsso !

@rkooo567
Copy link
Contributor

@hxue3 lmk if this was fixed!

@nivibilla
Copy link

nivibilla commented Oct 4, 2024

Im not the original poster but this worked for me as i was loading model from s3. it was taking more than 10mins for 70b+models. And increasing the timeout fixed the issue.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P2 Important issue, but not time-critical usability
Projects
None yet
Development

No branches or pull requests

5 participants