Running Basic Tune Experiments#
The most common way to use Tune is also the simplest: as a parallel experiment runner. If you can define experiment trials in a Python function, you can use Tune to run hundreds to thousands of independent trial instances in a cluster. Tune manages trial execution, status reporting, and fault tolerance.
Running Independent Tune Trials in Parallel#
As a general example, let’s consider executing N
independent model training trials using Tune as a simple grid sweep. Each trial can execute different code depending on a passed-in config dictionary.
Step 1: First, we define the model training function that we want to run variations of. The function takes in a config dictionary as argument, and returns a simple dict output. Learn more about logging Tune results at How to configure logging in Tune?.
from ray import tune
import ray
import os
NUM_MODELS = 100
def train_model(config):
score = config["model_id"]
# Import model libraries, etc...
# Load data and train model code here...
# Return final stats. You can also return intermediate progress
# using ray.train.report() if needed.
# To return your model, you could write it to storage and return its
# URI in this dict, or return it as a Tune Checkpoint:
# https://docs.ray.io/en/latest/tune/tutorials/tune-checkpoints.html
return {"score": score, "other_data": ...}
Step 2: Next, define the space of trials to run. Here, we define a simple grid sweep from 0..NUM_MODELS
, which will generate the config dicts to be passed to each model function. Learn more about what features Tune offers for defining spaces at Working with Tune Search Spaces.
# Define trial parameters as a single grid sweep.
trial_space = {
# This is an example parameter. You could replace it with filesystem paths,
# model types, or even full nested Python dicts of model configurations, etc.,
# that enumerate the set of trials to run.
"model_id": tune.grid_search([
"model_{}".format(i)
for i in range(NUM_MODELS)
])
}
Step 3: Optionally, configure the resources allocated per trial. Tune uses this resources allocation to control the parallelism. For example, if each trial was configured to use 4 CPUs, and the cluster had only 32 CPUs, then Tune will limit the number of concurrent trials to 8 to avoid overloading the cluster. For more information, see A Guide To Parallelism and Resources for Ray Tune.
# Can customize resources per trial, here we set 1 CPU each.
train_model = tune.with_resources(train_model, {"cpu": 1})
Step 4: Run the trial with Tune. Tune will report on experiment status, and after the experiment finishes, you can inspect the results. Tune can retry failed trials automatically, as well as entire experiments; see How to Define Stopping Criteria for a Ray Tune Experiment.
# Start a Tune run and print the best result.
tuner = tune.Tuner(train_model, param_space=trial_space)
results = tuner.fit()
# Access individual results.
print(results[0])
print(results[1])
print(results[2])
Step 5: Inspect results. They will look something like this. Tune periodically prints a status summary to stdout showing the ongoing experiment status, until it finishes:
== Status ==
Current time: 2022-09-21 10:19:34 (running for 00:00:04.54)
Memory usage on this node: 6.9/31.1 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/16.13 GiB heap, 0.0/8.06 GiB objects
Result logdir: /home/ubuntu/ray_results/train_model_2022-09-21_10-19-26
Number of trials: 100/100 (100 TERMINATED)
+-------------------------+------------+----------------------+------------+--------+------------------+
| Trial name | status | loc | model_id | iter | total time (s) |
|-------------------------+------------+----------------------+------------+--------+------------------|
| train_model_8d627_00000 | TERMINATED | 192.168.1.67:2381731 | model_0 | 1 | 8.46386e-05 |
| train_model_8d627_00001 | TERMINATED | 192.168.1.67:2381761 | model_1 | 1 | 0.000126362 |
| train_model_8d627_00002 | TERMINATED | 192.168.1.67:2381763 | model_2 | 1 | 0.000112772 |
...
| train_model_8d627_00097 | TERMINATED | 192.168.1.67:2381731 | model_97 | 1 | 5.57899e-05 |
| train_model_8d627_00098 | TERMINATED | 192.168.1.67:2381767 | model_98 | 1 | 6.05583e-05 |
| train_model_8d627_00099 | TERMINATED | 192.168.1.67:2381763 | model_99 | 1 | 6.69956e-05 |
+-------------------------+------------+----------------------+------------+--------+------------------+
2022-09-21 10:19:35,159 INFO tune.py:762 -- Total run time: 5.06 seconds (4.46 seconds for the tuning loop).
The final result objects contain finished trial metadata:
Result(metrics={'score': 'model_0', 'other_data': Ellipsis, 'done': True, 'trial_id': '8d627_00000', 'experiment_tag': '0_model_id=model_0'}, error=None, log_dir=PosixPath('/home/ubuntu/ray_results/train_model_2022-09-21_10-19-26/train_model_8d627_00000_0_model_id=model_0_2022-09-21_10-19-30'))
Result(metrics={'score': 'model_1', 'other_data': Ellipsis, 'done': True, 'trial_id': '8d627_00001', 'experiment_tag': '1_model_id=model_1'}, error=None, log_dir=PosixPath('/home/ubuntu/ray_results/train_model_2022-09-21_10-19-26/train_model_8d627_00001_1_model_id=model_1_2022-09-21_10-19-31'))
Result(metrics={'score': 'model_2', 'other_data': Ellipsis, 'done': True, 'trial_id': '8d627_00002', 'experiment_tag': '2_model_id=model_2'}, error=None, log_dir=PosixPath('/home/ubuntu/ray_results/train_model_2022-09-21_10-19-26/train_model_8d627_00002_2_model_id=model_2_2022-09-21_10-19-31'))
How does Tune compare to using Ray Core (ray.remote
)?#
You might be wondering how Tune differs from simply using Tasks for parallel trial execution. Indeed, the above example could be re-written similarly as:
remote_train = ray.remote(train_model)
futures = [remote_train.remote({"model_id": i}) for i in range(NUM_MODELS)]
print("Submitting tasks...")
results = ray.get(futures)
print("Trial results", results)
Compared to using Ray tasks, Tune offers the following additional functionality:
Status reporting and tracking, including integrations and callbacks to common monitoring tools.
Checkpointing of trials for fine-grained fault-tolerance.
Gang scheduling of multi-worker trials.
In short, consider using Tune if you need status tracking or support for more advanced ML workloads.